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CHAPTER  1 


INTRODUCTION 


1.1.  Thesis  Objectives 

The  development  of  realistic  models  to  describe  the  error  behavior  of  computer  sys¬ 
tems  is  a  difficult  problem.  Although  many  researchers  have  addressed  the  modeling  issue 
and  have  significantly  advanced  the  state  of  the  art.  there  is  little  or  no  validation  of  these 
models  with  field  data.  It  is.  therefore,  extremely  valuable  to  model  the  error  and  recovery 
process  in  a  production  system  using  real  error  data.  Apart  from  providing  useful  informa¬ 
tion  on  how  errors  occur,  this  process  also  provides  insight  into  the  interaction  between 
various  system  components.  Additionally,  it  will  be  seen  that  it  also  allows  explicit  model¬ 
ing  of  the  relationship  between  resource  usage  and  hardware  and  software  errors,  an  area 
that  has  yet  to  be  fully  explored. 

In  this  research  we  build  a  state-tranr  tion  model  which  describes  the  resource- 
usage/error/recovery  process  of  a  computer  system.  This  model  is  based  on  low-level  error 
and  resource  usage  data  collected  on  a  production  system.  The  data  were  collected  on  an 
IBM  3081  system  during  its  normal  operation.  Both  the  normal  and  erroneous  behavior  of 
the  system  are  modeled.  The  results,  therefore,  provide  an  understanding  of  the  different 
error  and  recovery  processes  and  their  relationship  to  various  types  of  resource  usage. 
Hardware  and  software  reliabilities  and  their  interaction  are  also  modeled.  Results  show 


that  the  error  and  recovery  process  on  our  measured  system  is  best  described  by  a  semi- 
Markov  process. 

1.2.  Related  Research 

The  primary  motivation  for  this  research  is  that  there  has  been  no  attempt  to  expli¬ 
citly  model  the  resource-usage/error/recovery  process  based  on  real  data.  The  only 
research  is  that  in  [l.2].  where  the  authors  proposed  the  use  of  a  double  stochastic  Poisson 
process  to  model  a  cyclic  load-error  relationship.  The  model  assumes  that  the  instantaneous 
error  rate  can  be  described  by  a  cyclostationary  Gaussian  process  (i.e..  the  workload  has  a 
cyclic  pattern).  Thus  only  the  external  behavior  has  been  modeled.  Furthermore,  only  a 
single  workload  variable  (time  spent  in  the  kernel  mode)  was  modeled. 

I 

Analytical  models  for  hardware  failure  have  been  extensively  investigated 
[3. 4. 5. 6. 7. 8].  Although  the  time  for  different  components  to  fail  is  usually  assumed  to  be 
exponentially  distributed,  time-dependent  failure  rates  and  graceful  degradation  have  been 
considered  along  with  performability  issues.  Repairability  has  been  modeled  by  Trivedi.  et. 
al..  [3, 5. 6. 8],  all  of  which  assume  constant  repair  times.  A  job/task  flow  based  model  is 
described  in  [9].  Failure  occurrence  is  assumed  to  be  a  linear  function  of  the  service 
requests  from  a  job/task  flow.  As  shown  in  [10],  the  assumption  of  linearity  may  result  in 
underestimating  the  effect  of  the  workload,  especially  when  the  load  is  high. 

Most  software  reliability  models  usually  refer  to  the  development,  debugging  and 
testing  phases  of  the  software  as  in  [11,12]  and  [13,14].  Few  of  these  models  have  been 
applied  to  the  operational  phase  of  the  software.  In  [2]  and  [15],  software  failures  in  an 
operating  environment  are  studied.  Both  studies  found  that  at  least  60%  of  system  failures 


are  software  related.  Another  study  [16]  shows  that  undetected  software-related  errors  are 
due  to  either  specification  errors,  implementation  errors,  or  logic  errors. 

There  is  little  explicit  study  of  hardware/software  reliability.  The 
hardware/software  interface  is  generally  hard  to  model  and  experimental  measurements 
are  not  easy  to  obtain  and  analyze.  In  [15].  software  failures  in  the  operating  system, 
which  could  be  related  to  hardware  problems,  were  analyzed  and  it  was  shown  that  errors 
in  the  hardware/software  interface  are  often  fatal.  In  [17].  a  methodology  for  joint 
hardware/software  model  construction  and  model  processing  using  Stochastic  Petri  Nets  is 
described. 

With  the  exception  of  the  software  reliability  growth  models,  which  have  been  vali¬ 
dated  with  real  data,  there  are  few,  if  any.  models  of  software  reliability  in  an  operational 
environment.  Exceptions  include  the  hardware  and  software  model  discussed  in  [18]  and  a 
measurement-based  model  of  workload  dependent  failures  discussed  in  [10].  Both,  how¬ 
ever.  only  describe  the  external  behavior  of  the  system  and  do  not  provide  insight  into  com¬ 
ponent  level  behavior. 

It  is  therefore  highly  instructive  to  construct  a  detailed  model  based  on  low-level 
er*or  data  from  a  production  system.  Toward  this  end  we  have  constructed  a  joint 
resource-usage/error/recovery  model  using  error  and  resource  usage  data  collected  from  an 
IBM  system.  The  model  provides  detailed  information  on  system  behavior  under  normal 
and  error  conditions.  Hardware  and  software  failures  of  different  severity  are  modeled. 
Multiple  errors  and  the  effect  of  on-line  recovery  routines  are  also  considered. 


1-3-  Thesis  Overview 


A  methodology  for  model  construction  based  on  real  error  data  and  resource  usage 
information  is  described  in  Chapter  2.  The  model  construction  includes  the  resource  usage 
(workload)  characterization,  error  and  recovery  characterization,  and  modeling  the  overall 
system.  For  the  workload  characterization,  we  use  a  statistical  clustering  method  to 
characterize  the  collected  resource  usages  of  the  measured  system  from  an  n-tuple  variable 
of  infinite  points  into  a  few  number  of  sets.  Thus,  a  state-transition  model  of  resource 
usages  of  the  system  is  constructed  based  on  these  sets. 

Different  types  of  component  errors  and  recovery  procedures  are  also  described  in 
detail  and  classified  in  Chapter  2.  A  two-level  error  data  reduction  scheme  is  employed  to 
identify  individual  error  incidents  and  ensure  that  the  analysis  is  not  biased  by  error 
records  relating  to  the  same  problem.  The  interaction  of  hardware  and  software  errors  is 
modeled  in  this  chapter.  The  three  models  describing  resource  usage,  error  and  recovery  are 
then  combined  to  form  an  overall  model.  The  conditional  transition  probabilities  as  well 
as  the  sojourn  times  of  states  are  estimated  from  real  data.  Results  show  that  the 
resource-usage/error/recovery  process  is  a  semi-Markov  process. 

In  Chapter  3  we  perform  four  different  kinds  of  model  analyses  to  show  the  charac¬ 
teristics  of  the  measured  system.  First,  we  use  the  model  built  in  Chapter  2  to  evaluate 
key  characteristics  of  the  system,  such  as  the  state  occupancy  probability  and  the  uncondi¬ 
tional  transition  probability  from  one  specified  state  to  another.  These  measures  provide  us 
with  a  very  fair  estimation  of  the  model  behavior.  Second,  we  estimate  the  error  probabil¬ 
ity  due  to  the  workload  from  the  model.  The  analysis  shows  that  the  error  probabilities 
appear  to  be  not  only  a  function  of  the  resource  usage,  but  are  also  related  to  the  length  of 
the  sojourn  time  in  a  resource  usage  state.  Third,  the  model  validation  is  performed  by 


I 
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comparing  the  results  predicted  from  the  model  with  the  values  estimated  from  the  actual 
observations.  Finally,  we  perform  an  analysis  to  investigate  the  significance  of  using  a 
semi-Markov  process,  as  opposed  to  a  Markov  process,  to  model  the  measured  system. 

In  Chapter  4  a  measurement-based  software  reliability  model  is  built.  In  addition  to 
describing  the  software  error  and  recovery  process  in  the  measured  system  this  model  also 
provides  a  quantification  of  software  system  error  characteristics  and  the  interaction 
between  different  types  of  software  errors. 

A  performability  model  based  on  real  data  is  proposed  in  Chapter  5.  A  reward  func¬ 
tion.  based  on  the  service  rate  and  the  error  rate  in  each  state,  is  defined  in  order  to  estimate 
the  performability  of  the  measured  system  and  to  depict  the  cost  of  different  error  types 
and  recovery  procedures.  The  conversion  of  a  semi-Markov  model  to  its  Markov  version  is 
also  demonstrated  in  this  chapter.  This  conversion  gives  us  the  ability  to  use  an  existing 
system  performance  estimator  to  estimate  the  performability  of  the  measured  system. 

In  Chapter  6  we  provide  a  summary  of  this  research  and  highlight  some  important 
conclusions  drawn  from  this  work. 


CHAPTER  2 


RESOUCE-USAGE/ERROR/RECOVERY  MODELING 


2.1.  Workload  Modeling 

In  this  section  we  build  a  state- transition  model  to  describe  the  variation  in  system 
activity.  It  will  later  be  shown  that  this  approach  allows  an  error  to  be  considered  as  a 
transition  from  normal  activity.  System  activity  is  characterized  by  a  number  of  resource 
usage  parameters.  A  statistical  clustering  technique  is  employed  to  reduce  the  potential 
many  to  many  transitions  of  the  workload  vector  to  a  small  number  of  states  representa- 
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tive  of  those  found  in  the  data.  The  data  for  our  studies  came  from  an  IBM  3081  system 
running  the  MVS  operating  system.  The  system  consists  of  dual  processors  with  two 
time-multiplexed  channel  sets.  Together  these  two  sets  allow  a  maximum  of  24  abchan- 
nels  to  be  simultaneously  active  in  each  I/O  cycle. 


! 

i 

2.1.1.  Resource  Usage  Characterization 

|  The  workload  data  was  collected  using  the  IBM  MVS/370  system  Resource  Manage- 

I  ment  Facility  (RMF)  [19].  RMF  is  a  flexible  tool  for  measuring  the  performance  of  an  IBM 

system.  It  measures  data  in  two  ways:  by  exact  count  and  by  sampling.  The  exact  count 
|  method  checks  the  appropriate  system  indicators  at  the  beginning  and  the  end  of  an  interval 

and  calculates  the  difference.  The  sampling  method  checks  the  appropriate  system  indica- 
|  tors  at  each  cycle  within  an  interval  Ce.g.,  an  interval  may  be  one  hour  and  a  cycle  may  be 
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500  milliseconds).  At  the  end  of  the  interval  the  mass  of  data  collected  at  each  cycle  is 
reduced  to  either  minimum.  maximum.  and  average  values  or  to  a  percentage  value.  The 
results  presented  here  are  based  on  three  months  of  sampled  RMF  data,  with  a  cycle  time  of 
500  milliseconds  and  an  interval  of  one  hour. 

Four  different  resource  usage  measures  were  selected  to  represent  the  workload  of 
three  basic  components  of  the  computer  hardware  system. 

CPU  -  fraction  of  the  measured  interval  for  which  the  CPU  is  executing  instructions 

CHB  -  fraction  of  the  measured  interval  for  which  the  channel  was  busy  and  the 
CPU  was  in  the  wait  state  (this  parameter  is  usually  used  to  measure  the  degree 
of  contention  in  our  system) 

SIO  -  number  of  successful  Start  I/O  and  Resume  I/O  instructions  issued  to  the 

channel 

DASD  -  number  of  requests  serviced  on  the  direct  access  storage  devices 

Although  several  other  measures  were  available,  we  decided  to  use  only  the  measures  listed 
above  so  as  to  keep  the  model  trackable.  The  methodology  presented  here  is  easily  extended 
to  incorporate  other  measures. 

2.1.2.  Workload  Clustering 

At  any  interval  of  time  the  measured  workload  is  represented  by  a  point  in  4- 
dimensional  space.  (CPU.  CHB.  SIO.  DASD).  Ouster  analysis  is  used  to  divide  the  work¬ 
load  into  similar  classes  according  to  a  pre-defined  criterion.1  This  allows  us  to  concisely 
describe  the  dynamics  of  system  behavior  and  extract  a  structure  that  already  exists  in  the 

'Potentially,  we  cea  hare  an  uncountably  large  number  of  points  in  the  workload  space.  Intuitively,  only  a 
countable  number  of  combinations  of  four  measures  do  in  fact  occur.  Further,  it  is  seen  that  they  usually  occur  in 
clusters. 


workload  data.2  Each  cluster  (defined  by  its  centroid)  is  then  used  to  depict  a  system  state 
and  a  state-transition  diagram  (consisting  of  inter-cluster  transition  probabilities  and  clus¬ 
ter  sojourn  times)  is  developed. 

A  fc-means  clustering  algorithm  [21.22]  was  used  for  cluster  analysis.  Briefly,  the 
algorithm  partitions  an  ^-dimensional  population  into  k  sets  on  the  basis  of  a  sample.  It 
starts  with  k  groups  each  of  which  consists  of  a  single  random  point.  Each  new  point  is 
added  to  the  group  with  the  closest  centroid.  After  a  point  is  added  to  a  group,  the  mean  of 
that  group  is  adjusted  in  order  to  take  the  new  point  into  account.  This  process  is  repeated 
until  the  changes  in  the  cluster  means  become  negligibly  small.  Thus  at  each  stage  the  k- 
means  are.  in  fact,  the  means  of  the  groups  they  represent.  Therefore,  k  non-empty  clus¬ 
ters,  CVC2 . Ck .  are  sought  such  that  the  sum  of  the  squares  of  the  Euclidean  distances  of 

the  cluster  members  from  their  centroids  is  minimized,  i.e.. 

k 

£  Z1  ljci  II 2  -*  minimum 
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where  xi  €  C,  and  Xj  is  the  centroid  of  cluster  Cj . 

Two  types  of  workload  clusters  were  formed.  In  the  first  case  CPU  and  CHB  were 
selected  to  be  the  workload  variables.  This  combination  was  found  to  best  describe  the 
CPU-bound  load  (nearly  60%  of  the  observations  have  a  CPU  usage  greater  than  0.72).  In 
the  second  case  the  clusters  were  formed  considering  SIO  and  DASD  as  workload  variables. 
This  combination  was  found  to  best  describe  the  I/O  workload.  Table  2.1  shows  the  results 
for  these  two  cases. 

An  examination  of  Table  2.1  also  shows  the  dynamics  of  the  measured  system 
behavior.  We  see  in  Table  2.1(a)  that  about  36%  of  the  time  the  CPU  is  highly  loaded 

2  Similar  clustering  techniques  are  also  used  for  workload  characterization  in  [20). 


Table  2.1.  Characteristics  of  workload  clusters 


(a)  CPU  workload 


Cluster 

id 

%of 

obs 

Mean 

of  CPU 

Mean 

of  CHB 

Std  dev 

of  CPU 

Std  dev 

of  CHB 

7.44 

0.0981 

1 

W2 

0.50 

0.1126 

0.5525 

1 

V. 

2.73 

0.1547 

1 

0.0755 

12.41 

0.3105 

0.1637 

0.0550 

0.0459 

0.74 

0.3639 

0.3819 

0.0365 

0.1923 

17.12 

0.5416 

0.1287 

0.0560 

0.0511 

22.58 

0.7207 

0.0848 

0.0576 

0.0301 

36.48 

0.9612 

0.0168 

0.0362 

0.0143 

R*  of  CPU  - 
R2  of  CHB  - 
overall  R2  - 


0.9724 

0.8095 

0.9604 


(b)  I/O  workload 


Cluster 

id 

%of 

obs 

Mean 

of  SIO 

Mean 

of  DASD 

Std  dev 

of  SIO 

Std  dev 

of  DASD 

8.89 

16.80 

0.95 

6.80 

1.30 

"z 

36.05 

41.59 

2.99 

7.51 

1.92 

"z 

1.48 

44.37 

20.62 

8.55 

4.18 

1.73 

60.07 

38.84 

6.77 

8.42 

u5 

42.72 

67.34 

5.19 

7.92 

3.72 

u* 

0.49 

87.30 

31.19 

3.87 

9.84 

u. 

7.9 

96.20 

6.02 

8.73 

3.34 

0.74 

141.10 

10.10 

10.28 

8.50 

£2  of  SIO-  0.8861 

R2  of  DASD  -  0.7176 

overall  R2  •  0.8751 
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(0.96)  and  almost  76%  of  the  time  the  CPU  load  is  above  0-5.  Since  the  measured  system  is 
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a  two-processor  machine,  we  may  say  that  76%  of  the  time  at  least  one  of  the  processors  is 
busy.  Note  that,  with  increasing  CPU  usage.  CHB  (CPU  wait  and  channel  busy)  decreases. 
This  indicates  that  resource  contention  is  not  a  problem  in  our  measured  system.  In  Table 
2.1(b)  (the  I/O  load),  both  clusters  U2  and  U3  have  a  very  close  channel  start  I/O  rate  (SIO) 
but  the  disk  service  rate  (DASD)  of  U3  is  as  much  as  10  times  that  of  Uz.  This  indicates 
that  some  I/O  requests  result  in  a  burst  of  data  while  the  others  only  in  a  few  words.  A 
burst  transfer  however  occurred  only  4%  of  the  time  (f/3  +  U4  +  (Jb).  This  result  may  be 
due  to  the  fact  that  our  measurements  were  made  during  work  hours,  but  I/O-bound  jobs 
are  normally  executed  during  off -work  hours. 
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2.0.  Resource  Usage  Model 

State-transition  diagrams  of  these  two  different  types  of  workload  clusters  are  shown 
in  Figure  2.1  and  Figure  2.2.  The  transition  probabilities  from  state  i  to  state  j .  p,j ,  are 
estimated  from  the  measured  data  using: 

observed  number  of  transitions  from  state  i  to  state  j 


observed  number  of  transitions  from  state  i 

These  two  figure  provide  us  with  not  only  the  details  of  workload  dynamics  but  also  the 
interactions  among  clusters.  Figure  2.1  shows  that  once  the  CPU  load  reached  0.5  (W6).  the 
transition  of  the  greatest  probability  was  to  its  next  higher  load  (W7)  and  the  transition  to 
its  next  lower  load  ( W4J)  occurred  with  the  second  greatest  probability.  This  can  be  seen 
in  states  W6.  W7.  and  Wj.  However,  when  the  CPU  load  is  low  (i.e..  less  than  0.5),  the 
change  to  a  higher  load  is  much  faster.  For  example,  with  0.333  probability  the  CPU  load 
changed  from  W3  to  W4^  and  0.424  probability  from  W4^  to  W7. 
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2.2.  Error  Modeling 

In  this  section  the  collection  and  characterization  of  errors  is  discussed.  A  state- 
transition  diagram  to  describe  different  error  states  is  developed.  The  measured  system 
incorporates  built-in  error  detection  facilities,  and  many  components  also  provide  for 
recovery  through  retry  or  redundancy.  The  error  and  recovery  information  is  logged  into  a 
permanent  data  set  called  LOGREC  [23].  For  each  error,  whether  recoverable  or  not.  the 
operating  system  creates  a  time-stamped  record  describing  the  error  and  providing  relevant 
information  on  the  state  of  the  machine.  In  each  record  there  are  a  number  of  bits  describ¬ 
ing  the  type  of  error,  its  severity,  and  the  result  of  hardware  and  software  attempts  to 
recover  from  the  problem.  From  this  data  sis  different  types  of  errors  were  collected  : 


(1)  CPU-related  errors  -  those  that  affect  the  normal  operation  of  the  CPU; 

the  errors  may  originate  in  the  CPU  itself,  in  the 
main  memory,  or  in  a  channel. 


(2)  Temporary  channel  errors  -those  that  are  recovered  by  channel  retry  and  do  not 

result  in  the  termination  of  the  channel  control  pro¬ 
gram, 

(3)  Temporary  (soft)  disk  errors  -  those  I/O  errors  that  are  recovered  by  correcting  the 

data  or  by  retrying  the  hardware  instruction. 

(4)  Temporary  (hard)  disk  errors  -  those  I/O  errors  that  are  recovered  by  software 

instruction  retry  or  by  a  functional  recovery 
routine(s). 


(5)  Permanent  disk  errors 


-  those  I/O  errors  that  are  not  correctable  and  can  not 
be  recovered  by  retrying  the  operation,  and 


(6)  Software  errors 


-  software  incidents  that  are  due  to  invalid  supervi¬ 
sor  calls,  program  checks  and  other  software  excep¬ 
tion  conditions. 
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22.1.  Error  Clustering 

Due  to  the  manner  in  which  errors  are  detected  and  reported  in  a  computer  system,  it 
is  possible  that  a  single  fault  may  manifest  itself  as  more  than  one  error,  depending  on  the 
activity  at  the  time  of  the  error.  The  different  manifestations  may  not  all  be  identical  [24]. 
The  system  recovery  usually  treats  these  errors  as  isolated  incidents.  In  order  to  address 
this  problem  and  to  ensure  that  the  analysis  is  not  biased  by  error  records  relating  to  the 
same  problem,  two  levels  of  data  reduction  were  performed. 

First,  a  coalescing  algorithm  described  in  [10]  was  used  to  analyze  the  data  and  merge 
observations  which  occur  in  rapid  succession  and  relate  to  the  same  problem.  Next,  a  tech¬ 
nique  described  in  [24]  to  automatically  group  records  most  likely  to  have  a  common  cause, 
was  used  (See  Appendix  A  for  the  details).3  By  using  these  two  methods,  we  classified 
errors  into  five  different  classes.  These  classes  are  called  error  events  since  they  may  con¬ 
tain  more  than  one  error  and  are  defined  as  follows. 


CPU  :  that  caused  errors  to  be  logged  as  CPU-related  errors 

CHAN  :  that  caused  errors  to  be  logged  as  channel  errors 

SWT  :  that  caused  errors  to  be  logged  as  software  errors 

DASD  :  that  caused  errors  to  be  logged  as  direct  access  storage  device  errors 

MULT  :  that  caused  errors  affecting  more  than  one  type  of  component 

Table  2.2  lists  the  frequencies  of  different  types  of  errors.  In  this  table  we  found  that 
about  80%  of  errors  are  disk  and  software  errors.  We  also  note  that  about  17%  of  the  errors 
are  classified  as  multiple  errors  (MULT).  A  MULT  error  is  mostly  due  to  a  single  cause  but 
the  fault  has  non-identical  manifestations  provoked  by  different  types  of  system  activity. 

’Although  this  second  reduction  is  not  essential  to  this  work,  it  allows  us  to  notice  several  multiple  errors 
which  otherwise  would  not  have  been  noticed. 
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Since  the  manifestations  are  non-identical,  recovery  may  be  complex  and  hence  imposes 
considerable  overhead  on  the  system.  It  should  be  noted  that  such  an  error  event  (17%  of 
our  data)  has  not  been  modeled  before. 


23.  Recovery  Modeling 

When  an  error  is  detected  in  the  measured  system,  an  appropriate  recovery  routine  is 
invoked  depending  on  the  severity  of  the  error.  The  recovery  procedures  were  divided  into 
four  categories  in  increasing  order  of  recovery  cost.  The  recovery  cost  was  measured  in 
terms  of  the  system  overhead  required  to  handle  an  error.  The  lowest  level  (hardware 
recovery),  involves  the  use  of  an  error  correction  code  (ECC)  or  hardware  instruction  retry 
and  has  minimal  overhead.  If  hardware  recovery  is  not  possible  (or  unsuccessful),  the  next 
level,  i.e..  software  controlled  recovery,  is  invoked.  This  could  be  simple,  e.g..  terminating 
the  current  program  or  task  in  control,  or  complex,  e.g.,  invoking  a  specially  designed 
recovery  routine(s)  to  handle  the  problem.  The  third  level  of  recovery  (ALT)  involves 
transferring  the  tasks  to  a  functioning  processor(s)  when  one  of  the  processors  experiences 


Table  2.2.  Frequency  of  errors 


Type  of  error 

Frequency 

Percent 

CPU 

2 

0.04 

CHAN 

119 

2.23 

MULT 

924 

17.33 

SWE 

1913 

36.07 

DASD 

2364 

44.34 

total 

5332 

an  un-recoverable  error.  If  no  on-line  recovery  is  possible,  the  system  is  brought  down  for 
off-line  repair.  Figure  2.3  shows  a  flow  chart  of  the  recovery  process.  Table  2.3  lists  the 
distribution  of  recovery  levels.  From  Table  2.3  we  note  that  about  73%  of  errors  were 


CPU.  CHAN.  DASD 


successful 


successful 


successfully  handled  through  hardware  recovery  and  most  of  the  others  were  recovered 
from  by  use  of  the  software  recovery  procedure. 


24.  Resource-Usage/Error/Recovery  Model 

In  this  section  we  combine  the  separate  workload,  error  and  recovery  models, 
developed  so  far.  into  a  single  model  shown  in  Figure  2.4.  A  null  state  W0  is  added  to 


Figure  2.4.  State-transition  diagram  of  resource-usage/error/recovery  model 


represent  the  state  during  the  non-measured  period  although  it  is  not  shown  in  Figure  2.4. 
The  transition  probabilities  among  states  are  estimated  from  the  measured  data  using  Equa¬ 
tion  2.1.1.  Notice  that,  unlike  other  models  this  describes  both  the  normal  and  erroneous 
behavior  of  the  system.  The  model  has  three  different  classes  of  states:  normal  operation 
states  ( Sy ),  error  states  (S£).  and  recovery  states  CS*).  Note  that  the  normal  state  has  two 
different  types  of  transitions:  the  first,  to  other  normal  states  and  the  second,  to  error 
states. 

Under  normal  conditions,  the  system  makes  transitions  from  one  workload  state  to 
another.  The  occurrence  of  an  error  results  in  a  transition  to  one  of  the  error  states.  The 
system  then  goes  into  one  or  more  recovery  modes  after  which,  with  a  high  probability,  it 
returns  to  one  of  the  “good"  workload  states.  The  state-transition  diagram  of  Figure  2.4 
shows  that  nearly  98.3%  of  the  hardware  recovery  requests  and  99.7%  of  the  software 
recovery  requests  are  successful.  Thus  the  error  detection,  fault  isolation  and  on-line 
recovery  mechanism  allow  the  measured  system  to  handle  an  error  efficiently  and 
effectively.  In  only  less  than  1%  of  the  cases  is  the  system  not  able  to  recover. 

Figure  2.5  shows  the  state-transition  diagram  of  a  MULT  error  (a  MULT  event),  i.e.. 
given  that  a  multiple  error  has  occurred.  The  model  shows  that  disk  and  software  errors 
are  strongly  correlated  in  multiple  errors.  From  the  diagram,  it  is  seen  that  in  about  65%  of 
the  cases  a  multiple  error  starts  as  a  software  error  (SWE)  and  in  32%  of  the  cases  it  starts 
as  a  disk  error  (DASD).  Given  that  a  disk  error  has  occurred  there  is  nearly  a  30%  chance 
that  a  software  error  will  follow.  It  is  also  interesting  to  note  that  there  is  a  64%  chance 
that  one  software  error  will  be  followed  by  another  different  software  error. 


from  good 
workload 


states 


Figure  2.5.  State-transition  diagram  for  multiple  errors  (MULT) 


2^4.1.  Waiting  and  Holding  Time  Distributions 

We  used  the  state-transition  diagram  to  show  the  relationship  among  the  workload. 


error,  and  recovery  processes  in  the  measured  system.  We  also  showed  the  interactions 
among  the  errors.  In  this  subsection  we  will  present  the  characteristics  of  the  measured 
system  in  terms  of  the  state  waiting  and  holding  times. 
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The  waiting  time  for  state  i  is  the  time  that  the  process  spends  in  state  i  before  mak¬ 
ing  a  transition.  The  holding  time  for  a  transition  from  state  i  to  state  j  is  the  time  that 
the  process  spends  in  state  i  before  making  a  transition  to  state  j  [25].  Table  2.4  shows  the 
mean  waiting  times  of  both  the  workload  and  error  states.  It  is  well-known  that  the  mean 
and  standard  deviation  of  an  exponential  distribution  are  the  same.  Thus  an  examination  of 
the  mean  and  standard  deviation  of  the  waiting  times  in  Table  2.4  appears  to  indicate  that 
not  all  waiting  times  are  simple  exponentials.  This  is  particularly  pronounced  in  Table 
2.4(c)  which  refers  to  the  error  states. 

Figure  2.6  shows  the  densities  of  waiting  and  holding  times  for  one  of  the  CPU  load 
states.  VV,  (see  Appendix  B  for  all  states).  Figure  2.6(a)  shows  the  waiting  time  for  VV,. 
and  Figure  2.6(b)  and  2.6(c)  represent  the  holding  times  from  state  Ws  to  DASD  and  SWE 
error  states.  These  densities  are  fitted  to  phase-type  exponential  density  functions  [26]. 


/(r)  =  £  aigiU ) 


where  ai^Q.  £ ai  *  1.  and  n  is  the  number  of  phases.  The  g;(r )  function  can  be  a  simple 
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exponential,  a  multi-stage  hyperexponential,  or  a  multi-state  hypoexponential  density 
function.  The  definitions  of  these  three  types  of  exponential  functions  are  listed  below. 


(l)  Exponential:  g(f)*X«-Xl. 


(2)  Hyperexponential:  g(t)~  Z.ai^ie  ‘  ■  where  X;  >0.  ai  ^0.  and  =  l. 


(3)  Hypoexponential:  g(f)  =  Lai*i«  •  where  Xf  >0.  X,  ?*X.  if  i^j.  and 


Table  2.4  Mean  waiting  time  (in  seconds)  of  states 


State 


(a).  CPU  bound  workload  states 


#of 

Mean 

Standard 

Std  error 

obe 

waiting  time 

deviation 

of  mean 

1263.71 

1384.20 

190.13 

289.65 

1.19 

0.84 

698.79 

913.30 

204.22 

1203.05 

1130.28 

99.13 

613.74 

421.73 

127.16 

1380.86 

1588.76 

131.04 

1071.31 

1004.46 

61.36 

1612.72 

2576.35 

157.97 

(b).  I/O  workload  states 


State 


#of 

obe 

Mean 

waiting  time 

Standard 

deviation 

45 

1221.75 

1475.70 

316 

1453.19 

1530.75 

12 

1437.15 

1452.86 

18 

1137.41 

616.67 

420 

1243.63 

1550.49 

4 

1696.85 

1540.19 

86 

937.45 

1127.57 

9 

387.74 

176.96 

Std  error 
of  mean 


98 

11 


419.40 

145.39 

75.66 

770.10 

121.59 

58.99 


CHAN 

SWE 

DASD 

MULT 


Error  states 


#of 

Mean 

Standard 

State 

obe 

waiting  time 

deviation 

Std  error 
of  mean 


■H 

wenKyiH 

5.08 

18.31 

5.08 

41.35 

103.35 

7.29 

401 

120.86 

223.89 

11.18 

77 

293.28 

262.84 

29.95 

Duration  (minutes) 
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Thus  the  graphs  in  figure  2_5  were  fitted  to  the  following  functions  (tested  by  using  the 
Ko lmogoro v-Smimo v  test  [26]  at  the  O.Ol  significance  level). 

(1) .  waiting  time  :  fit)  ■  0.000146*  “°  0021  +  0.000939*  H)001031  +  0.0OOO33*-0'0002102' 

(2) .  to  a  DASD  error  :  /  it )  »  0.00094* “0004‘  +  O.OOOS355(«‘‘00009371  -  e-0  00*39* ) 

(3) .  to  a  SWE  error  :  fit)**  0.0008 5e~°MU  +  0.00070 lC*-0000716*  -  e"°  0M6aa‘ ) 

2.4.2.  Recovery  Distributions 

In  our  data,  the  selection  of  the  destinations  from  any  state  of  Sg  was  found  to  be 
independent  of  the  holding  time  distribution.  Further,  for  our  system  the  time  taken  for 
each  type  of  recovery  can  reasonably  be  considered  constant.  The  overall  recovery  time, 
i.e..  the  duration  of  an  error  event  (or  the  holding  time  in  an  error  state),  however  was  not 
constant  since  an  error  event  may  involve  more  than  one  recovery  attempt.  This  time  is 
computed  as  the  time  difference  between  the  first  detected  error  and  the  last  detected  error 
caused  by  the  same  event.  The  duration  of  an  error  event  can  be  used  to  measure  the 
effectiveness  of  recovery  from  this  event  and  also  the  severity  of  error.  Figure  2.7  shows 
examples  of  error  duration  densities  for  three  different  types  of  errors.  Again,  the  follow¬ 
ing  ph’se-type  exponential  densities  were  fitted  to  the  graphs  shown  in  Figure  2.7  (tested  at 
the  0.01  significance  level). 

(1) DASD:  fit)  -  0.0375*  “°151  +  0.007* +  0.008635*^  0145621 

+  0.000186 1«“°  00213771 

(2)  SWE:  /(f)  -  0.041 18 1*"0  0445181  +  0.0002 704*  "°  00360751 
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(3)  MULT:  fit)  -  0.00437 l(e~0■003il'’,  -  e*0030109* ) 


2-5.  Summary 

In  summary,  we  have  developed  a  state-transition  model  which  describes  the  normal 
and  error  behavior  of  the  system.  Some  key  characteristics  of  the  model  are: 

(1)  workload  dynamics  are  explicitly  described. 

(2)  error/recovery  is  explicitly  described. 

(3)  waiting  times  in  some  workload  and  in  most  error  states  can  not  be  modeled  as  sim¬ 
ple  exponentials,  and 

(4)  the  holding  times  from  a  given  workload  state  to  various  error  states  are  dependent 
on  the  destinations. 

Thus,  the  resource-usage/error/recovery  process  is  modeled  as  a  complex  irreducible  semi- 
Markov  process  with  the  state  OFFL  as  recurrent,  making  the  overall  model  ergodic. 
Furthermore,  the  process  is  not  an  independent  semi-Markov  process  since  the  waiting  and 
holding  time  distributions  are  distinct  for  some  states. 
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CHAPTER  3 


MODEL  ANALYSIS 


Now  that  we  have  an  overall  model,  we  show  the  usage  of  this  model  to  predict  key 
system  characteristics.  The  mean  time  between  different  types  of  errors  is  evaluated  along 
with  model  characteristics  such  as  the  occupancy  probabilities  of  key  error  and  workload 
states.  Since  the  normal  state  transitions  are  also  available,  we  can  explicitly  examine  those 
states  which  are  crucial  from  a  error  viewpoint.  In  order  to  evaluate  the  model  behavior, 
the  model  parameters,  however,  have  to  be  defined  and  then  the  derivations  of  the  measures 
can  be  carried  out.  Thus  in  the  next  section  we  provide  the  definitions  of  the  model  parame¬ 
ters  and  the  derivations  of  some  important  measures. 


3.1.  Model  Parameters 


From  Chapter  2  we  know  that  the  measured  system  is  best  modeled  as  a  semi-Markov 
process.  Assume  that  M  is  an  n  -state  semi- Markov  model  and  given  a  stochastic  transition 

ft 

probability  matrix  P  =  J.  ptJ  >0,  i  =  1.2,. ..n.  y=1.2...j».  Y.Pij~ L  411(1  a  holding  time 


density  function  matrix  H  (t)  =  p»,-j  (f  )|.  f€  (O.oo),  the  mean  holding  time  of  the  process 
staying  in  state  i  before  making  transition  to  state  / .  , .  is 


a'  ■  V 


We  mentioned  that  in  Section  2.4.1  the  waiting  time  for  state  i  is  the  time  that  the  process 
spends  in  state  i  before  making  a  transition.  Thus,  a  waiting  time  is  merely  a  holding  time 
that  is  unconditional  on  the  destination  state.  Hence  the  mean  waiting  time  fi  is  related  to 
the  mean  holding  time  by 

it 

7i  -  I Pu7ij  •  (3.1.2) 

Suppose  that  a  process  has  been  operating  unobserved  a  long  time  and  given  that  the 
process  is  now  making  a  transition,  the  probability  that  the  transition  is  to  state  j ,  i Tjt 
must  satisfy  n  simultaneous  equations 

ft 

"j  =  2>i Pij  ■  (3.1.3) 

t*l 

We  note  that  these  n  equations  are  linearly  dependent.  This  linear  dependency  can  be  easily 
shown  by  summing  these  n  equations,  which  results  in  1  -  1.  Therefore  no  unique  solution 
for  iTj  can  be  obtained  from  just  by  solving  the  equations  (3.1.3).  Since  we  know  that  the 
probabilities  that  the  transition  to  all  states  have  to  sum  to  one.  i.e.. 

It 

Z>s  *  1  ■  (3.1.4) 

Then  we  can  use  Equation  3.1.3  in  conjunction  with  Equation  3.1.4  to  provide  an  unique 
solution  for  the  steady  state  transition  probability.  After  we  substitute  Equation  3.1.4  into 
the  left  hand  side  of  Equation  3.1.3,  we  have 


lsirjPjJ  +  2>i(!  +PiA 


(3.1.5) 


The  unique  solution  for  iri  can  be  obtained  by  solving  n  linear  equations  of  (3.1.5).  The 


matrix  form  of  the  solution  is 


if-  O  U-I  +  P  . 


(3.1.6) 


where  ir.  O.  U  and  I  are: 

(1)  ir  =  (tt ir2 . irn). 

(2)  O  is  an  unit  row  vector,  i.e.,  all  elements  are  one. 

(3)  U  is  an  unit  matrix,  and 

(4)  I  is  an  identity  matrix. 

After  deriving  the  steady  state  probability  (also  called  limiting  state  probability),  the  pro¬ 
bability  of  a  state  being  occupied  by  the  process  and  the  probability  of  the  process  entering 
a  specified  state  can  be  obtained  accordingly. 

The  steady  state  occupancy  probability  of  state  j ,  denoted  as  ^ .  is  the  probability 
that  the  process  occupies  state  j  when  the  system  reaches  a  stable  stage,  and  is  evaluated  as 


•i  =  = 


(3.1.7) 


where  7  =  £  iri  7t . 


We  are  sometimes  interested  not  only  in  the  probability  that  the  process  will  occupy  a 
state  at  some  time  in  the  future  but  also  in  the  probability  that  the  process  will  enter  a 


state  at  some  particular  future  time.  Thus,  the  probability  that  the  process  is  just  entering 
state  j  at  some  time  instant  after  the  system  is  in  the  steady  state.  e} .  is  just  the  state  occu¬ 
pancy  probability  divided  by  its  mean  waiting  time  [25]: 


After  substituting  the  result  of  Equation  3.1.7  into  Equation  3.1.8  we  have 


(3.1.8) 


Cj  =  —  •  (3.1.9) 

f 

In  a  semi-Markov  process  —  as  in  every  life  —  an  important  question  is  "How  long 
does  it  take  to  get  from  here  to  there?".  Assume  that  the  time  it  takes  to  reach  state  j  for 
the  first  time  if  the  system  is  in  state  i  at  time  zero  is  QtJ  ,  then  f  tJ  (c ).  the  probability  that 
0; ^  =  t .  is  defined  as  [25]: 

=  Pro  (0i>;  =  f) 

r 

n 

=  LPisfkiArtfrjb-vtear  +  Pijhtjby  (3.L10) 

r  =  l  0 

r*) 

The  process  can  make  transitions  to  other  states  before  it  first  reaches  state  j  at  time  t .  or  it 
may  stay  in  state  i  and  then  make  a  direct  transition  to  state  j  at  time  t .  The  first  term  of 
the  right  hand  side  of  Equation  3.1.10  computes  the  probability  of  being  in  state  i  for  any 
(T€[0.  t )  and  the  probability  of  the  process  being  in  another  state  r  at  the  beginning  time  of 
r— <r  after  the  process  is  out  of  state  i.  The  second  term  computes  the  probability  if  the 
process  makes  a  transition  directly  from  state  i  to  state  j  at  time  t .  Therefore,  the  time  to 
move  from  state  i  to  state  j  can  be  estimated  as  the  mean  first  passage  time  for  a  process 
from  state  i  to  j ,  and  it  is  evaluated  as 


Since  the  (t )  is  a  recursive  function  which  is  shown  in  Equation  3.1.10.  the  computation 


time  increases  exponentially  as  t  increases.  However,  in  statistical  the  mean  of  a  random 


variable  can  be  estimated  as  the  first  moment  of  its  moment  generating  function,  e.g..  the 


first  derivative  of  its  exponential  transformation  at  point  zero.  The  exponential  transfor¬ 


mation  of  a  function  g  (r  ).  denoted  as  g*  Cr  )  is  defined  as 


ge(s)  =  f g(t)e~*  dt 


(3.1.12) 


So.  the  exponential  transformation  of  ).  is 


■  ZP„f  fhu(<r)frJ(c-<r)e~"d<rdt  +  p:J  J  h^t.De  “  dt 


r  =  l  0  0 

r*J 


r»  1  0  0 


+  Pijfhijb'*e~Stdt 


(3.1.13) 


Therefore,  the  mean  first  passage  time  for  a  process  from  state  i  to  j  is 


WWW 


ILaMU'l 


*  ^  i 


!  I  /, 

rm  X 


Vi — 

ds 


r  "  ' 


ds 


L-o +  Pij  — K 
ds 


-Pij 


—h-jCs')fjj(s)  +  h*j(.s  (■*  ) 

dr  dr 


Since  — h*j(s)  l,^  is  the  first  moment  of  holding  time  distribution,  i 
ds 

time  7tJ .  and 


/ufrHW/ijfr)*  = 1  • 
0 

the  mean  first  passage  time  0—  can  be  derived  as  below. 


QiJ  =  ZPir^ir  PijQjj 

ra  1 


By  Equation  3.1.2,  Equation  3.1.14  can  be  written  as  : 


QiJ  =7i  +  ~PijQjj  ■ 

r*  1 

However,  the  mean  recurrence  time  0;j  is  actually  the  reciprocal 
entrance  rate  into  state  j ,  i.e. 

1 


jC*)  l-o 

.e..  the  mean  holding 


(3.1.14) 

(3.1.15) 

of  the  steady  state 

(3.1.16) 


Thus  Equation  3.1.15  can  be  written  as 


VMUIUIUIUIHIIMUIIIIHU 


nvgmgv* 


0 


ij 


Vi- 


Pu 


+  LPirQrj 

r«l 


and  its  matrix  form  is 
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(3.1.17) 


e  =  K  +  pe. 


where  0 


0;j  I  and  K  =  (TU)  —  E-1  where  T.  U  and  E  are  : 


(3.1.18) 


(1)  T  is  a  diagonal  matrix  in  which  Tu  =  ,  and  T(J  -  0  otherwise. 

(2)  U  is  a  unit  matrix.  i.e..  all  elements  are  one.  and 

(3)  E  is  a  diagonal  matrix  in  which  E,  t  -  ei .  and  E^-Oif  t+J. 

Therefore,  the  mean  first  passage  time  matrix  ©  can  be  derived  as: 


K 

e  = - 

i-p 


(3.1.19) 


3.2.  Model  Behavior 

In  this  section  we  use  the  measures  that  were  defined  previously  to  predict  the  key 
system  characteristics  for  given  stochastic  transition  probability  matrix  and  holding  time 
density  function  matrix  which  were  estimated  from  the  collected  data.  By  solving  the 
semi-Markov  model  we  discover  that  the  system  makes  a  transition  every  9  minutes  and  8 
seconds,  on  average.  In  comparing  this  with  the  mean  time  between  error  (MTBE)  listed  in 
Table  3.1.  it  is  clear  that  most  often  the  transitions  are  from  one  workload  state  to  another. 
Also  note  that  the  model  indicates  an  MTBE  of  4152  hours  for  CPU  errors.  This  number  is 
estimated  by  solving  the  model  equations  although  there  were  no  observations  in  the  meas¬ 
ured  period.  (In  examining  the  error  data  over  a  one  year  period  we  found  two  CPU 


Table  3.1.  Mean  time  between  errors 


Type  of 
error 


CPU 

CHAN 

SWE 

DASD 

MULT 


Frequency 

count 


Mean  time  between 
errors  (hour) 


11.12 


errors.)  The  table  also  shows  that  a  disk  error  occurs  (as  indicated  in  the  model)  almost 
every  52  minutes  while  a  software  error  is  detected  every  1  hour  and  45  minutes.  Most  of 
the  disk  errors  (95%)  are  recovered  through  hardware  recovery  (i.e..  hardware  instruction 
retry  or  ECC  correction),  thus  resulting  in  negligible  overhead.  This  shows  that  on-line 
recovery  is  highly  effective  and  provides  a  system  with  the  ability  to  tolerate  a  fault  and 
recover  almost  instantaneously.  Thus,  a  highly  reliable  system  is  achieved. 

Table  3.2  lists  the  mean  recurrence  time  for  recovery  routines.  It  shows  that  the  on¬ 
line  hardware  recovery  routine  is  invoked  once  every  0.62  hours,  while  a  software  recovery 
occurs  every  2.57  hours.  As  mentioned  earlier,  hardware  recovery  involves  hardware 
instruction  retry  or  ECC  correction.  The  maximum  number  of  retries  is  predetermined.  In 
the  measured  system  each  CPU  has  a  26-nanosecond  machine  cycle  time  and  the  disk  seek 
time  is  about  25  milliseconds.  We  estimate  a  worst  case  hardware  recovery  cost  of  0.5 
seconds.  i.e..  incorporating  twenty  I/O  retries:  ten  through  the  original  I/O  path  and  another 
ten  through  an  alternative  I/O  path  if  the  alternative  is  available.  This,  of  course,  overesti¬ 
mates  the  cost  of  hardware  retry  used  for  the  CPU  errors.  However,  the  impact  is  very 


[X 


!W»5 


Table  3.2.  Mean  recurrence  time 


Type  of 

Mean  recurrence 

recovery 

time  (hour) 

Hardware 

0.62 

Software 

2.57 

Alternative 

- 

Off-line 

651.37 

insignificant.  This  can  be  seen  by  comparing  the  estimated  time  for  each  hardware  recovery 
with  the  recovery  overhead.  The  comparison  shows  that  the  cost  of  hardware  recovery  is 
worth  only  0.02%  of  total  computation  time.  The  mean  recurrence  time  of  the  alternative 
recovery  routine,  is  not  estimated  due  to  lack  of  data,  i.e.,  this  event  seldom  occurred. 

The  model  characteristics  are  summarized  in  Table  3.3.  A  dashed  line  in  this  table 
indicates  a  negligible  value  (less  than  0.00001  probability).  Table  3.3(a)  shows  the  normal 
system  behavior.  Given  that  a  transition  has  occurred  the  system  CPU  load  is  most  likely  to 
reach  to  W 7  or  Ws,  i.e.  0.72  or  above.  This  is  also  reflected  in  the  entry  and  occupancy  pro¬ 
babilities  (e  and  <&).  From  the  occupancy  probabilities  we  see  that  almost  34%  of  the  time 
the  CPU  load  is  as  high  as  0.96  (Ws);  39%  of  the  time  the  CPU  is  moderately  loaded  ( W6  + 

W7>. 

Table  3.3(b)  shows  the  erroneous  system  behavior.  The  table  indicates  that  about  30% 
of  the  transitions  are  to  an  error  state  (obtained  by  summing  all  the  ir's  for  all  the  error 
states).  The  DASD  errors  have  the  highest  transition  and  entry  probabilities.  Since  a  transi¬ 
tion  occurs  every  9  minutes,  we  estimate  that  an  error  is  detected,  on  the  average,  every  30 
minutes.  Of  course,  over  98%  of  these  errors  caused  negligible  overhead. 


35 


Table  3.3.  Summary  of  model  characteristics 


i 

g 


(b).  Error  and  recovery  states 


Error  state 

Recovery  state 

Measure 

CPU 

CHAN 

SWE 

OASO 

MULT 

HWR 

in  ■  II  ^ 

SWR 

ALT 

OFFL 

0 

0.00005 

0.0066 

0.0383 

0.0179 

0.00022 

0.00011 

* 

- 

IT 

0.00004 

0.0055 

0.0850 

0.1692 

0.0322 

0.2379 

0.0572 

e 

- 

0.00001 

0.00016 

0.00032 

0.00006 

0.00045 

0.00011 

- 

e 

4152 

26.88 

1.75 

0.87 

4.62 

0.62 

2.57 

4089.5 

651.57 

(a).  CPU  bound  workload  states 


Workload  state 

Measure 

^0 

*1 

^2 

^3 

^4 

^3 

^6 

w. 

* 

0 

0.0625 

0.0008 

0.0136 

0.1258 

0.0054 

0.1639 

0-2255 

wmm 

ir 

0.0257 

0.0264 

0.0014 

04)104 

0.0559 

0.0047 

0.0635 

0.1125 

Hi 

e 

0.00005 

0.00005 

- 

0.00002 

0.0001 

0.00001 

0.00012 

0.00021 

0.00021 

6 

5.78 

5.62 

102-56 

14-32 

2^5 

31-38 

2-53 

1.32 

1.32 

An  interesting  characteristic  of  the  multiple  error  events  is  also  seen  in  Table  3.3(b). 
Although,  the  transition  probability  (ir)  of  a  MULT  error  is  lower  than  that  for  SWE 
(0.0322  vs.  0.0850).  its  occupancy  probability  (<X>)  is  higher  (0.0179  vs.  0.0066).  This  is  due 
to  the  fact  that  a  MULT  error  has  a  longer  sojourn  time  as  compared  to  SWE  error  events 
(293  seconds  vs.  41  seconds  from  Table  2.4). 


33.  Effect  of  Workload 

In  this  section  we  compute  the  steady  state  probability  of  being  in  a  specified  work¬ 
load  state  and  making  a  transition  to  a  specified  error  state.  Table  3.4  shows  the 


probabilities  of  a  error  occurring  at  various  load  levels.  In  this  table,  "time"  refers  to  the 
mean  holding  time  in  the  specified  workload  state  (e.g..  CPU  -  0.96)  before  the  process 
making  a  transition  to  the  selected  state  (e.g..  CHAN).  An  important  relation  between  error 
probability  and  holding  time  in  a  workload  state  is  seen  in  this  table.  The  error  probabili¬ 
ties  appear  to  be  not  only  a  function  of  resource  usage  [10],  but  also  related  to  the  length  of 
the  holding  time  in  a  resource  usage  state.  For  example,  in  Table  3.4(a)  the  probability  of  a 

Table  3.4.  Holding  time  and  transition  probabilities  to  error  states 

(a).  CPU  workload 


Error  state 

DASD  MULT  Total 


Prob 


CHAN 

SWE 

Time 

Prob 

Time 

Prob 

668.18 

0.0011 

1609.71 

0.0786 

596.28 

0.0032 

1118.12 

0.0492 

1304.96 

0.0010 

1507.92 

0.0471 

1218.62  0.1296 

971.62  0.0990 

1070.10  0.0489 


Time  -  in  seconds. 

(b).  I/O  workload 


MULT 

Time 

Prob 

1641.20 

0.0285 

757.09 

0.0146 

722.26 

0.0052 

Error  state 


DASD 

Load 


CHAN 


Time  Prob 


0 

898.35 

4522.67  I  0.0214 


256.23  I  0.0022 
243.35 
516.92 


DASD 


434.54  0.0162 

1170.84  0.1840 

1148.18  0.1198 


MULT 


578.97  0.0046 

928.12  0.0262 

1286.67  0.0234 


Total 

Prob 


.023 

0.3138 

0.2521 


*» 


channel  error  is  almost  the  same  for  two  different  CPU  loads.  0.96  and  0.54.  The  mean 
holding  time  before  a  channel  error  occurs  at  the  lower  load  is  larger  than  that  for  the 
higher  load.  i.e.  1304.96  seconds  versus  668.18  seconds.  When  the  holding  times  are  simi¬ 
lar.  however,  (or  increasing  with  increased  usage),  the  error  probabilities  do  increase  with 
increasing  resource  usage.  A  similar  phenomenon  also  exists  for  the  I/O  workload  (see 
Table  3.4(b)).  Thus,  not  only  does  a  higher  workload  result  in  a  higher  error  probability 
(for  similar  holding  times),  but  the  error  probability  also  increases  with  increased  holding 
time  in  a  particular  state.  In  other  words,  the  error  probability  appears  to  be  a  function  of 
the  absolute  amount  of  resource  consumed  in  a  given  state,  be  it  through  increased  work¬ 
load  and/or  increased  holding  times.  An  explanation  for  this  apparent  "wear  out" 
phenomenon  is  not  clear  (since  a  large  majority  of  our  errors  are  transient),  but  it  certainly 
calls  into  further  question  the  validity  of  the  frequently  used  constant  error  probability 
assumption  often  made  in  reliability  modeling. 


3.4.  Model  Validation 

In  Chapter  2  we  had  shown  that  the  resource-usage/error/recovery  process  of  the 
measured  system  is  best  modeled  as  a  semi-Markov  process.  This  is  due  to  the  fact  that  the 
waiting  and  holding  time  distributions  of  some  states  .re  not  exponentials.  In  order  to  vali¬ 
date  this  semi-Markov  assumption  we  will  compare  the  occupancy  probabilities  of  states 
predicted  from  the  model  with  the  values  estimated  from  the  collected  data. 

From  Equation  3.1.7  we  know  that  the  state  occupancy  probability  of  the  model.  <J>  . 


is  defined  as  : 


p 


However  the  actual  occupancy  probability,  denoted  as  4; .  can  be  estimated  from  the  col¬ 
lected  data  by  using  the  following  equation. 


I 

B 

I 

B 

B 

8 

§ 

i 


-  toul  time  that  system  observed  to  be  in  state  i 

. - 1 - .  (3.4.0 

length  of  the  observation  period 

Table  3.5  lists  the  comparison  of  these  two  measures.  4  and  4.  for  the  normal  states  with 
significant  occupancy  probability  (greater  than  0.1  probability)  and  for  one  key  error  state 
(DASD).  From  Table  3.5  we  see  that  the  predicted  probabilities  closely  match  those 

estimated  from  the  collected  data  (with  the  maximum  of  0.025  tolerance,  i.e.,  — ).  This 

4 


Table  3.5.  Comparison  of  occupancy  probabilities  for  different  states 


State 

^6 

4 

0.1255 

0.1639 

0.2255 

0.3398 

0.0383 

4 

0.1259 

0.1634 

0.2311 

0.3452 

0.0386 

< 

0.0001 

0.0005 

0.0056 

0.0054 

0.0003 

0.0008 

0.0031 

0.0242 

0.0156 

0.0078 

4  :  predicted  occupancy  probability 

4  :  actual  occupancy  probability 

<  :  the  abaolute  error.  |  ♦  — 
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indicates  that  the  semi-Markov  model  is  a  good  model  for  the  resource- 
usage/error/recovery  process  of  the  measured  system. 

3.5.  Markov  Versus  Semi-Markov 

In  this  section  we  investigate  the  significance  of  using  a  semi-Markov  model  to 
describe  the  overall  resource- usage/ error/ recovery  process.  It  has  been  argued  that  since 
errors  only  occur  infrequently  (i.e..  X  is  small),  a  Markov  model  may  well  approximate  the 
real  behavior.  Thus,  although  the  collected  data  shows  that  the  semi-Markov  process  is  a 
better  model  for  the  resource-usage/error/recovery  process,  i.e.,  more  closely  approximates 
the  data  from  the  measured  system,  it  is  reasonable  to  ask  what  deviations  may  occur  if  a 
Markov  process  is  assumed.  In  order  to  answer  this  question  we  use  a  Markov  model  to 
describe  the  resource-usage/error/recovery  process  of  the  measured  system  and  compare  the 
results  with  those  obtained  through  the  more  realistic  semi-Markov  model. 

Two  measures,  the  unconditional  transition  probability  to  the  next  state  (y)  and  the 
first  passage  time  (0).  are  used  as  the  basis  for  comparison. 


3.5.1.  Unconditional  transition  probability  (y) 


Given  a  stochastic  transition  probability  matrix  J j  and  a  holding  time  density 
function  matrix  jh;j(r)  .  the  unconditional  transition  probability  from  state  i  to  state  j . 
denoted  as  y;>/ .  in  the  semi-Markov  process  is  given  by  [25]: 


y>j  - 


(3.5.1) 


t  ■  i.l  *  1 


l‘*  j  L«  «•>  i**  >'i 


where  ftJ  is  the  mean  holding  time  before  a  transition  occurs  from  state  i  to  state  j  and  f 
is  the  mean  holding  time  of  the  process.  Because  the  selection  of  the  next  state  in  the  Mar¬ 
kov  process  is  not  dependent  on  the  holding  time  in  the  current  state.  i.e.  -  fijc  for 
every  j  and  k .  so  from  Equation  3.1.2  we  can  have 


*  TtPiJ^iJ  ~  ^ij  T,PiJ  = 


(3.5.2) 


Substitute  this  into  Equation  3.5.1  we  have 


(3.5.3) 


Further,  from  Equation  3.1.7  we  have 


y-.j  -  *iP,j- 


(3.5.4) 


Table  3.6  compares  the  unconditional  transition  probability  for  semi-Markov  and 
Markov  models.  We  see  from  Table  3.6(a)  that  when  the  CPU  load  is  as  high  as  0.96,  the 
transition  probabilities  to  the  software  and  multiple  errors  are  close  for  both  models.  This 
is  also  true  for  channel  error  when  the  CPU  load  is  0.54.  This  is  because  for  some  destina¬ 
tions  j  the  holding  time  to  the  next  state  is  the  same  as  the  waiting  time  of  the  current 
state,  i.e..  7,  =  7{J .  For  the  majority  of  the  cases,  however,  the  Markov  and  semi-Markov 
models  are  not  in  agreement.  Table  3.7  shows  the  ratios  of  the  unconditional  transition 
probability  ytJ  estimated  from  both  models.  Markov  versus  semi-Markov.  If  the  ratio  is 
less  then  1  then  the  Markov  process  underestimates  the  transition  probability,  otherwise,  it 
overestimates.  From  this  table  we  see  that  the  Markov  assumption  sometimes  overesti¬ 
mates  and  sometimes  underestimates  the  transition  probability.  In  particular  it  overesti- 


Table  3.6.  Comparison  of  transition  probabilities,  y 
(Markov  versus  Semi-Markov) 


(a).  From  CPU  workload  to  error  states 


Load 


Model 

Error 

CHAN 

SWE 

DASD 

MULT 

semi-Markov 

0.0011 

0.0786 

0.1296 

0.0285 

Markov 

0.0025 

0.0783 

0.1705 

0.0278 

semi-Markov 

0.0032 

0.0492 

0.0990 

0.0146 

Markov 

0.0058 

0.0469 

0.1086 

0.0206 

semi-Markov 

0.0010 

0.0471 

0.0489 

0.0052 

Markov 

0.0011 

0.0429 

0.0627 

0.0099 

(b).  From  I/O  workload  to  erro  states 


Error 


Load 

Model 

CHAN 

SWE 

DASD 

MULT 

96.20 

semi-Markov 

0 

0.0022 

0.0162 

0.0046 

Markov 

0 

0.0082 

0.0350 

0.0075 

67.34 

semi-Markov 

0.1840 

0.0987 

0.0263 

0.0049 

Markov 

0.0068 

0.0987 

0.1955 

0.0352 

semi-Markov 


Markov 


0.0214  0.1198  0.0875  0.0234 


mates  the  transition  probabilities  to  most  error  states,  regardless  of  the  state.  Overestima¬ 
tion  will  lead  to  an  unduly  conservative  reliability  estimate  and  underestimation  to  an 
overly  optimistic  estimate.  Thus  both  are  undesirable. 

3.5.2.  First  Passage  Time  (0) 

We  now  examine  the  difference  between  the  first  passage  times  under  the  Markov  and 
the  semi  Markov  assumptions.  The  first  passage  time  distribution  can  be  used  to  estimate 
the  MTBE  and  its  variance. 

The  mean  first  passage  time  from  state  £  to  state  /  .  0,j  in  a  semi-Markov  process  is 


given  in  Section  3.1  as  : 


From  this  equation,  we  notice  that  the  mean  first  passage  time  depends  on  only  the  mean 
holding  time  and  the  conditional  transition  probability  of  the  current  state.  Clearly,  if  the 
first  moment  of  the  first  passage  time  to  the  error  state  Ce.g.,  the  MTBE)  is  the  only  main 
concern,  the  Markov  process  should  be  tHe  to  provide  adequate  information.  If  the  distri¬ 
bution  (or  the  higher  moments)  of  the  first  passage  time  is  of  interest,  the  Markov  model 
may  be  inadequate,  particularly  if  the  variance  of  the  first  passage  time  is  large.  This  can 
be  seen  clearly  from  the  following  equation  [25]. 

j  n  n  n 

- (11^ j  +  £  T.^iPij-^ij-^rJ  )  *  -  j 

^ i  i=l  r=l  ! =1 

(3.5.6) 

n 

T2  i  +  £  PiA^ir^rj  +  otherwise 

r  =  l 

r*j 

This  equation  indicates  that  the  second  moment  of  the  first  passage  time  is  a  function  of  the 
second  moment  of  the  state  waiting  time,  as  well  as  the  mean  holding  time  to  the  next  state. 
Since  the  mean  (X)  and  the  standard  deviation  (cr)  of  an  exponential  distribution  are  the 
same,  and  the  second  moment  of  an  exponential  distribution  is  only  a  function  of  its  mean, 
i.e.  E[X2]  -  2*E[X]Z.  Thus,  a  Markov  assumption  may  under-  or  over-estimate  the  second 
moment.  E[X2].  if  <r?iE[X]. 

Table  3.8  shows  the  ratio  of  Q2 (Markov/semi-Markov)  for  transitions  from  a  few 
selected  workload  states  to  the  error  states.  From  Table  3.8,  we  see  that  the  Markov 
assumption  frequently  underestimates  the  second  moment  of  the  first  passage  time  (to  the 
error  state).  The  underestimation  can  be  as  much  as  30%.  However,  it  overestimates  the 
variation  of  first  passage  time  among  different  resource  usage  states,  although  this  is  not 
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Table  3.8.  Ratio  of  Q1  (Markov/semi-Markov)  J 

f 


Workload 

Error 

Resource 

Load 

CHAN 

SWE 

DASD 

MULT 

0.96 

0.989 

0.876 

0.694 

0.939 

CPU 

0.72 

0.996 

0.998 

0.909 

0.972 

0.54 

0.992 

0.937 

0.830 

0.953 

96.20 

1.006 

1.010 

0.901 

0.991 

DASD 

67.34 

1.002 

0.963 

0.888 

0.972 

41.59 

1.001 

0.948 

0.871 

0.963 

shown  in  the  table. 


3.5.3.  Summary 

In  summary,  our  measurements  show  that  using  a  Markov  model  frequently  overesti¬ 
mates  the  unconditional  transition  probabilities  and  underestimates  the  variance  of  the  first 
passage  times  to  the  error  states.  The  overestimation,  of  course,  will  lead  to  an  unduly  con¬ 
servative  reliability  prediction.  It  can  be  argued  that  such  gross  overestimation  (as  seen  in 
some  cases  here)  is  undesirable  and  may  not  be  cost  beneficial.  The  underestimation  is  no 
doubt  a  serious  problem  which  may  lead  to  unduly  optimistic  reliability  prediction. 


CHAPTER  4 


SOFTWARE  RELIABILITY  MODEL 


4.1.  Introduction 

The  problem  of  modeling  software  reliability  during  the  development,  debugging  and 
validation  phases  of  the  software  cycle  is  a  well  researched  area.  However,  there  are  few 
studies  which  model  software  error  and  recovery  processes  in  a  fully  operational  produc¬ 
tion  environment.  The  difficulties  are  partly  due  to  the  fact  that,  unlike  computer 
hardware,  which  is  reasonably  modularized,  each  software  system  can  have  its  own  pecu¬ 
liar  characteristics.  At  this  stage,  it  is  extremely  valuable  to  develop  a  comprehensive  model 
quantifying  the  software  error  and  recovery  processes  in  a  production  system  using  real 
data.  In  addition  to  providing  useful  information  on  how  and  when  errors  occur  in  the  real 
world,  this  process  provides  the  quantification  of  the  interaction  among  different  types  of 
software  errors:  an  important  result  for  developing  analytical  models. 

In  this  chapter,  a  state- transition  model  to  describe  the  software  error  and  recovery- 
processes  in  a  complex  operating  system  is  described.  Measurements  were  made  on  an  MVS 
(Multiple  Virtual  Storage)  system  running  on  an  IBM  3081  mainframe.  Time-stamped  low 
level  error  and  recovery  data  from  MVS.  collected  during  the  normal  operation  of  the  sys¬ 
tem,  formed  the  basis  for  developing  the  model.  The  semi-Markov  model  developed  from 
the  real  data  provides  a  quantification  of  the  system  error  characteristics  and  also  gives  an 


insight  into  the  interaction  between  the  various  software  error  and  recovery  processes 
occurring  during  normal  system  operation. 

4.1.1.  Related  Research 

Most  software  reliability  models  usually  refer  to  the  development,  debugging  and 
testing  phases  of  the  software  [11. 12. 13. 14].  Few  of  these  models  have  been  applied  to  the 
operational  phase  of  the  software.  In  [2]  and  [15],  software  failures  in  an  operating 
environment  are  studied.  Both  studies  found  that  at  least  60%  of  system  failures  were 
software  related.  There  has  been  little  explicit  study  of  hardware/software  reliability.  In 
[15],  software  failures  related  to  hardware  problems  in  the  operating  system  are  analyzed 
and  it  is  shown  that  errors  in  the  hardware/software  interface  are  often  fatal.  In  [27].  a 
resource-usage/reliability  model  was  developed  from  real  data  and  it  was  seen  that  about 
36%  of  detected  errors  (not  necessarily  system  failures)  were  related  to  software  problems. 

With  the  exception  of  software  reliability  growth  models,  which  have  been  validated 
with  real  data,  there  are  few.  if  any,  models  of  software  reliability  in  an  operational 
environment.  Exceptions  are  the  hardware  and  software  model  discussed  in  [18]  and  a 
measurement-based  model  of  workload  dependent  failures  discussed  in  [10].  However, 
these  only  describe  the  external  behavior  of  the  system  and  do  not  provide  insight  into 
component-level  behavior. 

It  is  therefore  highly  instructive  to  develop  a  detailed  model  based  on  low-level  error 
data  from  a  production  system.  In  the  following  sections,  we  construct  a  error/recovery 
model  for  the  MVS  operating  system.  Software  problems  of  differing  severity  are  modeled. 
Multiple  errors  are  also  considered  and  and  the  effect  of  on-line  recovery  routines  is  taken 


into  account. 


4.2.  Error  Characterization 

In  this  section  the  collection  and  characterization  of  the  software  error  and  error 
recovery  data  are  discussed.  A  state-transition  diagram  is  developed  to  describe  the 
different  error  and  recovery  states.  This  allows  us  to  determine  the  serverity  of  errors  and 
effectiveness  of  recovery. 

Error  data  based  on  different  causes  were  collected.  Information  on  software  errors  is 
automatically  logged  by  an  operating  system  module.  Details  of  the  logging  mechanism  are 
described  in  [23].  In  order  to  ensure  that  the  analysis  is  not  biased  by  error  records  relating 
to  the  same  problem,  two  levels  of  data  reduction  which  were  described  in  Chapter  3  were 
performed.  As  a  result,  the  software  errors  were  classified  into  eight  classes.  These  eight 
classes  are  called  error  events,  since  they  may  contain  more  than  one  error,  and  are  defined 
as  follows. 

(1)  Control  (CTRL)  -  incidents  indicating  the  invalid  use  of  control  state¬ 

ments  and  invalid  supervisor  calls 

(2)  Deadlocks  (DLCK)  -  incidents  indicating  system  or  operator  detected 

endless  loop,  endless  wait  state  or  violation  of  system 
or  user-defined  time  limits 

(3)  I/O  and  Data  Management  (I/O)  -  incidents  indicating  problems  occurred  during  I/O 

management  or  during  the  creation  and  processing  of 
data  sets 

(4)  Storage  Management  (SM)  -  incidents  indicating  errors  in  the  storage 

allocation/ de-allocation  process  or  in  virtual  memory 
mapping 

(5)  Storage  Exceptions  (SE)  -  incidents  indicating  addressing  of  nonexistent  or 

inaccessible  memory  locations 

(6)  Programming  Exceptions  (PE)  -  incidents  indicating  program  errors  other  than 

storage  exceptions 


(7)  Others  (OTHR) 


-  incidents  indicating  that  problems  occurred  which 
do  not  fit  the  above  categories 


E 


(8)  Multiple  Errors  (MULT)  -  incidents  indicating  more  than  one  type  of  error 

listed  above 


l 

i 

i 

I 


Table  4.1  lists  the  frequencies  of  different  types  of  software  error  events  defined 
above.  The  table  shows  that  more  than  one  half  (52.5%)  of  software  errors  were  I/O  and 
data  management  errors  and  another  11.4%  of  the  errors  were  storage  management  errors. 
A  significant  percentage  (17.4%)  of  errors  were  classified  as  multiple  errors  and  are 
specifically  modeled  in  the  following  sub-section. 

4.2.1.  Multiple  Errors 

A  multiple  error  most  often  is  due  to  a  single  fault  that  has  non-identical  manifesta¬ 
tions  provoked  by  different  types  of  system  activity.  Since  the  manifestations  are  not 
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Table  4.1.  Frequency  of  software  errors 


£ 


■r», 

•S 
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Type  of  Errors 

Frequency 

Percent 

Control 

213 

7.72 

Deadlock 

23 

0.84 

I/O  &  Data  Management 

1448 

52.50 

Program  Exception 

65 

2.43 

Storage  Exception 

149 

5.40 

Storage  Management 

313 

11.35 

Others 

66 

2.32 

Multiple  Error 

481 

17.44 

Total 

2758 

100.00 

£ 


i 
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identical,  recovery  may  be  complex.  Figure  4.1  shows  the  state-transition  diagram  of  a 
multiple  error  developed  from  the  data.  The  transition  probability  from  state  i  to  state  j . 
Pi  j .  is  estimated  from  the  measured  data  using: 

observed  number  of  transitions  from  state  i  to  state  j 

Pij  = - • 

observed  number  of  transitions  from  state  i 

This  figure  not  only  illustrates  the  possible  interactions  among  different  software  errors  but 
also  provides  detailed  information  on  the  occurrence  of  transitions.  For  example,  if  a  pro¬ 
gram  exception  error  (PH)  occurs,  there  is  about  a  63%  chance  that  a  storage  exception  (SE) 
on  error  will  follow.  Further,  there  is  more  than  a  50%  chance  that  one  storage  error  will 
be  followed  by  another  error  of  the  same  type  (52%  for  storage  management  and  also  for 
storage  exception).  If  we  only  focus  on  those  transitions  with  significant  probabilities  (i.e., 
higher  than  0.1).  the  number  of  states  in  Figure  4.1  can  be  reduced  to  five.  The  state- 
transition  diagram  for  these  active  states  is  illustrated  in  Figure  4.2.  Notice  that  a  cyclic 
path  is  formed  by  the  I/O  and  data  management  (I/O)  along  with  the  two  different  types  of 
exception  states  (program  exception  and  storage  exception). 

4.2.2.  Recovery  Modeling 

Recovery  in  MVS  is  designed  as  a  means  by  which  the  system  can  prevent  a  total  loss. 
Whenever  a  program  is  abnormally  interrupted  due  to  the  detection  of  an  error,  the  Super¬ 
visor  gets  control.  If  the  problem  is  such  that  further  processing  could  degrade  the  system 
or  destroy  data,  the  Supervisor  gives  control  to  the  Recovery  Termination  Manager  (RTM). 
If  a  recovery  routine  is  available  for  the  problem  program.  RTM  gives  control  to  this  rou¬ 
tine  before  deciding  to  terminate  the  program. 


Figure  4.2.  Reduced  state-transition  diagram  of  multiple  errors 


The  purpose  of  a  recovery  routine  is  to  free  the  resources  kept  by  the  failing  program 
(if  any),  to  locate  the  error,  and  to  request  either  a  continuation  of  the  termination  process 
or  a  retry.  Recovery  routines  are  generally  provided  to  cover  critical  MVS  functions.  It  is 


however,  the  responsibility  of  the  installation  (or  of  the  user)  to  write  a  recovery  routine 
for  other  programs. 

More  than  one  recovery  routine  can  be  specified  for  the  same  program:  if  the  latest 
recovery  routine  asks  for  a  termination  of  the  program,  the  RTM  can  give  control  to 
another  recovery  (if  provided).  This  process  is  called  "percolation."  The  percolation  process 
ends  if  either  a  routine  issues  a  valid  retry  request  or  no  more  routines  are  available.  In  the 
latter  case,  the  program  and  its  related  subtasks  are  terminated.  If  a  valid  retry  is 
requested,  a  retry  routine  restores  a  valid  status  using  the  information  supplied  by  the 
recovery  routine(s)  and  gives  control  to  the  program.  In  order  for  a  retry  to  be  valid,  the 
system  should  verify  that  there  is  no  risk  of  error-recurrence  and  that  the  retry  address  is 
properly  specified.  An  error  may  have  four  possible  effects. 

1)  Retry  -  The  system  successfully  recovered  and  returned  control  to  the 

problem  program. 

2)  Task  Termination  -  The  program  and  its  related  subtasks  are  terminated,  but  the  sys¬ 

tem  is  not  affected. 

3)  Job  Termination  -  The  job  in  control  at  the  time  of  the  error  is  aborted. 

4)  System  Damage  -  The  job  or  task  in  control  at  the  time  of  the  error  was  critical  for 

system  continuation.  Thus,  job/task  termination  resulted  in  system 
failure. 

Figure  4.3  illustrates  the  steps  in  the  recovery  process.  It  is  clear  that  recovery  can  be  as 
simple  as  a  retry  or  more  complex,  requiring  several  percolations  before  a  retry.  The  prob¬ 
lem  can  also  be  such  that  no  retry  or  percolation  is  possible.  Table  4.2  shows  the  percentage 
for  these  different  types  of  situations.  For  example,  for  storage  management  errors,  approx¬ 
imately  8%  of  the  cases  resulted  in  a  direct  retry,  84%  involved  some  percolation  and  over 
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Table  4.2.  Percentages  of  recovery  attempts  for  a  software  error 


Type  of  error 

Retry 

(%) 

Percolation 

(%) 

No-Percolation 

(%) 

Control 

78.38 

21.62 

0.0 

Deadlock 

2.78 

97.22 

0.0 

I/O  &  Data  Management 

93.49 

6.51 

0.0 

Program  Exception 

20.09 

79.91 

0.0 

Storage  Exception 

28.09 

71.91 

0.0 

Storage  Management 

7.77 

83.73 

8.50 

Others 

14.89 

85.11 

0.0 

8%  could  not  be  percolated  any  further  (i.e.  jobs/task  termination).  The  table'  shows  that 
only  in  a  small  percentage  of  the  cases  was  the  problem  un-recoverable  (no-percolation). 


4.3.  Software  Reliability  Model 
4.3.1.  Overall  Error/Recovery  Model 

In  this  section  we  combine  the  separate  error  and  recovery  models  to  construct  a  single 
overall  model  shown  in  Figure  4.4.  Note  that  a  state.  Normal,  represents  the  normal  sys¬ 
tem  operation.  The  results  of  the  recovery  process  are  classified  into  three  different  states 
(resume  op,  task  term  and  job  term)  to  reflect  the  severity  of  errors.  The  model  thus  pro¬ 
vides  a  complete  overview  of  software  error  and  recovery  from  an  error  occurrence  to  its 
recovery. 


Figure  4.4.  Software  error/recovery  model 


43.2.  waiting  Time  Distributions 


Table  4.3  shows  the  characteristics  of  both  normal  and  error  states  in  terms  of  their 
waiting  times.  Note  that  the  duration  of  a  single  error  is  generally  in  the  range  of  20-40 
seconds  on  the  average,  except  for  deadlock  and  "others*.  The  table  also  shows  that  the 
errors  not  classified  are  relatively  insignificant  since  their  duration  is  less  than  2  seconds. 
Program  exceptions  take  twice  as  long  as  control  errors  (42  seconds  versus  21  seconds). 
This  is  possibly  due  to  the  extensive  software  involvement  in  recovering  from  program 
exceptions.  Figure  4.5  shows  the  density  of  waiting  time  in  the  normal  operation  state,  i.e., 
the  density  of  the  time  to  error.  This  density  could  not  be  fitted  to  a  simple  exponential, 
and  because  of  the  shape  of  this  density  we  found  that  it  was  fitted  to  a  multi-stage  gamma 


Table  4.3.  Mean  waiting  time  (in  seconds)  of  states 


State 

#  of 

obs 

Mean 

waiting  time 

Standard 

deviation 

Std  Error 

of  mean 

Normal 

2757 

10461.33 

32735.04 

623.44 

Control 

213 

21.92 

84.21 

5.77 

Deadlock 

23 

4.72 

22.61 

4.72 

I/O  &  Data  Management 

1448 

25.05  • 

77.62 

2.04 

Program  Exception 

65 

42.23 

92.98 

11.53 

Storage  Exception 

149 

36.82 

79.59 

6.52 

Storage  Management 

313 

33.40 

95.01 

5.37 

Others 

66 

1.86 

12.98 

1.60 

Multiple  Error 

481 

175.59 

252.79 

11.53 

Duration  (minutes) 


Figure  4.5.  Time  to  error  density 


density  function  better  than  to  a  phase-type  exponentials  at  the  same  acceptable  level.  The 
multi-stage  gamma  density  function  /(f)  is  defined  as 


/(f)  =  Z  aig(f;aj.  r,).  (4.3.1) 

;=i 


where  a^O.  £a;  =  1.  and  n  is  the  number  of  stages.  The  g(t;  at.  s)  is  a  gamma  density 

i=l 

function  (with  s  the  distance  shifting  from  the  origin). 
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(4.3.2) 


where  T(a)  is  a  gamma  function.  Hence,  the  density  in  Figure  5  so  obtained  has  five  stages, 
given  by 


/(x)  =  0.748  g(x:  2.1.  -1)  +  0.055  g(x:  0.5.  0)  +  0.069  g(x;  3.5.  3) 


+  0.030  gCx:  5.0.  8)  +  0.098  g(x;  5.0.  17). 
tested  using  the  Kolmogorov-Smimov  test  [26]  at  the  0.01  significance  level. 

43.3.  Recovery  Time  Distribution 

For  the  purposes  of  evaluating  the  time  for  recovery,  we  assumed  that  each  recovery 
mode  takes  a  constant  amount  of  time.  The  overall  recovery  time.  i.e..  the  duration  of  an 
error  event  Cor  the  waiting  time  in  an  error  state),  however  was  not  constant,  since  an  error 
event  can  involve  more  than  one  recovery  attempt  or  may  require  more  than  one  recovery 
routine.  The  recovery  time  was  then  computed  as  the  time  difference  between  the  first  and 
the  last  detected  error  caused  by  the  same  event.  The  duration  of  an  error  event  was  used 
to  measv  he  effectiveness  of  recovery  from  this  event  and  also  the  severity  of  the  error. 

Figure  4.6  shows  the  recovery  time  densities  for  three  different  types  of  errors:  I/O 
and  data  management,  storage  management,  and  multiple  errors.  Note  that  none  of  these 
densities  could  be  fitted  by  simple  exponentials  at  an  acceptable  level  of  significance.  Thus, 
they  were  fitted  to  phase-type  exponential  density  functions  [26]. 
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(1)  I/O:  /  (f  )  »  0.07825*  ^>0S39<‘  +  0.000354*  “°003938* . 

(2)  SM:  /(f)  =  0.1642*“°-*  +  0.030424*  “°03329*  +  0.0006634* ^)  006586' .  and 

(3)  MULT:  /  (f  )  =  0.078*"0*  +  0.002426* -°003866‘  +  0.002 163e“°-00352*. 

43.4.  Summary 

In  summary,  the  model  developed  explicitly  quantifies  the  error  and  recovery  process 
in  the  measured  software.  We  note  both  the  time  to  error  and  the  recovery  time  distribu¬ 
tions  in  several  key  states  cannot  be  modeled  as  simple  exponentials.  Hence  the  overall  pro¬ 
cess  is  modeled  as  a  semi-Markov  process.  Further,  the  semi-Markov  process  is  irreducible 
with  the  resume  operation  (resume  op)  state,  the  job  termination  (job  term)  state,  and  the 
task  termination  (task  term)  state  being  recurrent. 

In  the  next  sub-section  we  analyze  the  overall  model  to  determine  key  software  error 
characteristics.  The  mean  time  between  different  types  of  errors  is  evaluated  along  with 
model  characteristics  such  as  the  occupancy  probability  of  key  error  states. 


4.4.  Model  Analysis 

4.4.1.  General  Characteristics 

By  solving  the  semi-Markov  model,  we  discover  that  the  measured  software  system 
made  a  transition,  on  the  average,  every  43  minutes  and  22  seconds.  Table  4.4  lists  the 
mean  time  between  different  software  errors  (i.e..  mean  time  between  errors)  and  Table  4.5 
shows  the  mean  recurrence  time  for  recovery  processes.  By  examining  the  mean  recurrence 
time  for  I/O  and  MULT  errors  from  Table  4.4  and  comparing  them  with  the  mean  waiting 
times  in  Table  4.3,  we  find  that  although  the  I/O  errors  occur  about  3  times  as  often  as  the 
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4.4.2.  Model  Probabilities 

Given  the  irreducible  semi-Markov  model  of  Figure  4.4.  the  following  steady  state 
probabilities  were  evaluated.  The  derivations  of  these  measures  are  given  in  Section  3.1. 


(1)  transition  probability  (■nv )  -  given  that  the  process  is  now  making  a  transition. 

the  probability  that  the  transition  is  to  state  j 

(2)  occupancy  probability  (^  )  -  at  any  instant  time  the  probability  that  the  process 

occupies  state  j 


(3)  entry  probability  (e^) 


-  at  any  instant  given  that  the  process  is  entering  a 
state,  the  probability  that  the  process  enters  state  j 


(4)  mean  recurrence  time  (9^ ) 


-  mean  recurrence  time  of  state  j 


k 


The  model  characteristics  are  summarized  in  Table  4.6.  A  dashed  line  in  this  table 


Table  4.6.  Characteristics  of  software  error/recovery  model 


(a) 


Normal 

|  Error  state 

Measure 

state 

CTRL 

DLCK 

I/O 

PE 

SE 

SM 

OTHR 

MULT 

ir 

0.2474 

0.0191 

0.0020 

0.1299 

0.0060 

0.0134 

0.0281 

0.0057 

0.0431 

4> 

0.9950 

0.00016 

- 

0.00125 

0.000098 

0.000189 

0.00056 

_ : _ 

0.002915 

(b) 


Recovery  state 

Result 

^leasur^ 

limn 

Percolatlor^ 

No-Percolation 

Resume  op 

Task  term 

Job  term 

V 

0.1704 

0.0845 

0.0030 

0.1414 

0.0712 

0.0348 

_a 

e 

4.25 

8.55 

241.43 

5.11 

10.16 

20.74 

*  *  in  hour 


/.  ^.V.V.  t.  J\ . 


indicates  a  negligible  value  (less  than  0.00001  probability).  From  the  occupancy  probabil¬ 
ity  ,4>.  of  the  normal  state  in  Table  4.6(a).  we  see  that  in  about  99.5%  of  time  the  software 
system  is  operating  normally.  i.e..  only  0.5%  of  time  the  system  detects  software  errors. 
This  indicates  that  the  reliability  of  the  measured  software  system  can  be  as  high  as  0.995. 
In  Section  2.2  we  know  that  about  35%  of  observed  errors  were  software  errors.  Thus  the 
effect  on  the  overall  system  reliability  due  to  the  software  errors  is  very  significant. 

The  table  also  shows  that,  of  all  possible  transitions  made.  24.73%  are  to  an  error  state 
(obtained  by  summing  all  the  rr's  for  all  the  error  states)  and  another  25.79%  are  to  a 
recovery  state.  Since  it  was  seen  earlier  that  a  transition  occurs  every  43  minutes,  we  esti¬ 
mate  that  a  software  error  is  detected,  on  the  average,  every  3  hours.  From  Table  4.6(b). 
we  notice  that  although  an  error  is  detected  almost  every  3  hours,  a  successful  recovery 
(i.e..  results  in  resume  operation),  only  occurs  once  every  five  hours,  i.e..  nearly  43%  of  the 
errors  result  in  task/ job  termination. 

Multiple-error  events  formed  a  significant  category  on  their  own.  Since  this  type  of 
event  involves  several  errors  and  result  in  considerable  overhead,  it  is  analyzed  separately 
in  the  next  section. 

4.4.3.  Characteristics  of  A  Multiple  Error 

In  Section  4.2  we  pointed  out  that  about  17%  of  software  errors  were  multiple  errors. 
We  also  noticed  that  the  multiple  errors  mostly  consist  of  I/O.  storage,  or  program  errors. 
A  strong  connection  between  program  and  storage  exception  was  seen  in  the  occurrence  of  a 
multiple  error.  Table  4.7  lists  the  characteristics  for  a  multiple  error  and  was  obtained  by 
solving  the  semi-Markov  model  described  in  Figure  4.1  with  a  zero  holding  time  in  the 
normal  state  (i.e..  given  a  multiple  error  occurs).  From  Table  4.7  we  see  (from  ir, 


Table  4.7.  Characteristics  of  a  multiple  error 


Measure 

Normal 

state 

|  Error  state 

CTRL 

DLCK 

I/O 

PE 

SE 

SM 

OTHR 

V 

0.1767 

0.0327 

0.0048 

0.1451 

0.1473 

0.2957 

0.1360 

0.0617 

0 

0.0648 

0.0130 

0-1004 

0.0837 

0.2202 

0.2717 

0.0462 

t 

0.00568 

0.00105 

0.00015 

0.00466 

0.00473 

0.00950 

0.00437 

0.00198 

e 

0.0489 

0.2647 

1.8126 

0.0596 

0.0587 

0.0292 

0.0636 

0.1401 

*  -  in  hour 


transition  probability)  that  nearly  30%  of  the  transitions  are  made  to  the  storage  exception 
state  when  the  process  enters  a  multiple  error  mode.  Once  in  a  multiple  error  mode,  a 
storage  exception  error  occurs  every  1  minute  and  45  seconds  (0  -  0.0292  hours  in  Table 
4.7).  while  the  average  duration  of  multiple  errors  is  about  2  minutes  and  56  seconds  (0  - 
0.0489  hours,  the  recurrence  time  of  the  normal  state).  Note  that  the  average  duration  of  a 
multiple  error  predicted  here  from  the  model  is  very  close  to  the  mean  duration  of  a  multi¬ 
ple  error.  175.5  seconds  obtained  from  real  data,  listed  in  Table  4.3.  This  provides  a  strong 
evidence  that  the  semi-Markov  process  is  a  good  model  fo*  our  measured  system  due  to  its 
fairly  accurate  prediction.  As  soon  as  an  entry  into  a  multiple  error  is  made,  consecutive 
errors  are  detected  almost  every  31  seconds  (by  taking  the  reciprocal  of  the  stun  of  all 
entry  probabilities  «  in  Table  4.7).  This  indicates  that  about  5  to  6  errors  will  be  detected 
on  average,  once  a  multiple  error  occurs. 


There  are  several  interesting  characteristics  of  multiple  errors  which  can  be  derived 
from  the  model  of  Figure  4.1.  For  example,  if  we  want  to  know  the  probability  of  a 
storage  exception  error  given  an  I/O  error,  we  can  evaluate  it  by  the  multi-step  transition 


probability  to  the  SE  state  from  the  I/O  state.  This  turns  out  to  be  very  small,  only 
0.0076.  However,  we  find  that  the  probability  of  an  I/O  occurring  given  a  SE  occurs  at  any 
time  instant,  is  as  high  as  0.668.  This  is  partly  due  to  the  fact  that  for  a  semi-Markov  pro¬ 
cess  the  unconditional  transition  probability  at  any  time  instant.  ytJ  .  is  not  only  a  function 
of  conditional  transition  probability  ptJ  but  also  a  function  of  mean  holding  time.  This  can 
be  seen  in  Equation  2.5.1. 

4.5.  Conclusion 

In  this  study,  we  have  developed  a  semi-Markov  model  to  describe  the  error  and 
recovery  processes  in  the  MVS  system.  The  model  is  based  on  real  error  data  collected  dur¬ 
ing  normal  system  operation.  The  semi-Markov  model  developed  provides  a  quantification 
of  system  error  characteristics  and  the  interaction  between  different  types  of  errors.  As  an 
example,  we  provide  a  detailed  model  and  analysis  of  multiple  errors,  which  constitute 
approximately  17%  of  all  software  errors  and  result  in  considerable  overhead.  It  is  sug¬ 
gested  that  other  systems  be  similarly  analyzed  and  modeled  so  that  a  wide  range  of  realis¬ 
tic  models  of  software  reliability  in  an  operating  environment  are  available. 


CHAPTER  5 


PERFORMABDLITY  MODEL 


A  workload/reliability  model  is  built  based  on  real  data.  Given  a  stochastic  transition 
probability  matrix  and  a  holding  time  density  matrix,  the  system  behavior  such  as  the 
unconditional  transition  probability  and  state  occupancy  probability  in  the  steady  state  can 
be  estimated.  However,  the  performability  of  the  measured  system  is  not  yet  addressed. 
Thus  in  this  chapter  we  use  the  resource- usage/error/recovery  model  to  estimate  the  per¬ 
formability  of  the  system.  Reward  functions  are  used  to  depict  the  performance  degrada¬ 
tion  due  to  errors  and  also  due  to  different  types  of  recovery  procedures.  Toward  this  end. 
we  define  a  reward  rate  for  each  state  of  the  resource-usage/error/ recovery  model. 

5.1.  Reward  Function 

First,  we  propose  the  reward  rate  rL  (per  unit  time)  for  each  state  i  in  our  model  as 
follow: 


-  if  i  €  SiV  (J  SE 

+  *j 

0  if  i  e  SK  . 


(5.1.1) 


where,  he  s,  and  are  the  service  rate  and  the  error  rate  in  state  i .  respectively.  Thus  one 


p 

to 


unit  of  reward  is  given  for  each  unit  of  time  when  the  process  stays  in  the  normal  states 
SN .  The  penalty  paid  depends  on  the  number  of  errors  generated  by  an  error  event.  With 
an  increasing  number  of  errors  the  penalty  per  unit  time  increases,  and  accordingly,  the 
reward  rate  decreases.  Zero  reward  is  assigned  to  recovery  states.  This  is  due  to  the  fact 
that  during  the  recovery  process  the  system  does  not  contribute  any  useful  work  toward 
the  system  performance  besides  recovering  from  an  error.  Based  on  this  proposal,  reward 
rates  for  the  error  states  are  estimated  and  shown  in  Table  5.1.  We  know  that  from  Table 
3.3(b)  the  transition  probability  to  the  DASD  error  is  about  as  much  as  twice  to  the  SWE 
error  and  Table  5.1  shows  that  the  reward  gained  from  the  DASD  state  is  also  as  much  as 
twice  from  the  SWE  state.  Thus  we  expect  that  the  impact  due  to  the  DASD  error  on  the 
performability  is  much  higher  than  that  due  to  the  SWE  error.  In  order  to  understand  the 
effectiveness  of  various  errors,  first  we  show  some  important  performability  measures  that 
can  be  derived  from  the  model. 

Since  the  system  can  be  in  any  state  at  any  instant,  so  the  reward  rate  of  the  system  at 
time  t .  X(r ).  is  the  reward  rate  of  the  state  where  the  system  is  currently  occupied.  It  is  a 
random  variable  and  denoted  as 


Table  5.1.  Reward  rates.  rt .  for  error  states 

- n - 1 - 1 - ' - 


State  DASD 


SWE 


CHAN 


MULT 


It  ft*. 
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XU )  =  r j  |  process  is  in  state  t  at  time  f  . 


(5.1.2) 


Therefore  the  expected  reward  rate  at  time  t .  E[X(f  )].  can  be  evaluated  as 


E[X(f  )]  =  £/>,  it  )r,  . 


(5.1.3) 


where  p,  (r )  is  the  probability  of  the  process  being  in  state  i  at  time  t .  The  cumulative 
reward  by  time  t .  Y (f  ).  can  be  derived  from 


Y(r)  =  f  \(a)dcr 


Therefore,  the  expected  cumulative  reward  at  tiqae  t .  E[Y(r  )].  is  given  by  [28]: 


(5.1.4) 


E[Y(f )]  =  E  J\{<r)d<T  =  Ir./p.Co-ko- 


(5.1.5) 


Ir.  order  to  solve  for  pt{t )  and  hence  other  measures,  we  convert  the  semi-Markov  process 
into  a  Markov  chain  using  the  method  of  stages  [26,29],  The  conversion  of  the  semi- 
Markov  model  to  the  Markov  model  for  the  measured  system  is  described  in  section  5.2. 
Thus  the  state  probability  vector  Pic)  *  (..../>, it )....)  can  be  computed  by  solving  the  set  of 
differential  equations  of  the  form: 


—  Pit )  =  Pit  )Q 
dt 


where  Q  is  transition  rate  matrix  of  the  Markov  chain  [25], 


5.2.  Semi-Markov  to  Markov  Conversion 

As  we  know,  the  sojourn  time  distribution  of  states  is  the  only  difference  between 
semi-Markov  and  Markov  models.  For  a  Markov  model  the  sojourn  time  distributions  of 
states  must  be  exponentials,  however,  it  can  be  any  distribution  for  a  semi-Markov  model. 
Thus,  to  convert  a  semi-Markov  to  a  Markov  process,  one  must  change  the  non-exponential 
distributions  to  exponentials.  In  this  section,  we  show  how  to  convert  a  state  with  non¬ 
exponential  distribution  to  a  number  of  states  in  which  each  state  is  exponential. 

In  Section  2.4  we  fitted  the  state  holding  times  of  our  resource-usage/error/ recovery 
process  to  the  phase-type  exponentials.  The  phase-type  exponential  function  f  it)  can  be 
expressed  as 

n 

/(<)-  I«,*,Cr). 

I  *i 

n 

where  at> 0.  £a,  =  1.  and  a  is  the  number  of  phases.  For  each  phase  i .  the  g,(r )  can  be  a 
1*1 

simple  exponential,  a  multi-stage  hyperexponential,  or  a  multi-stage  hvpoexponential.  The 
definitions  of  these  three  types  of  functions  are  listed  below. 

Exponential  :  EXP  (A.) 

EXKA)  =  A*-* 

Hyperexponential  :  Hyper  (A, A,.. ..A, ) 

r  r 

Hyper(A,A, . Ar )  =  £a,EXP(A,)  =  £a1Ai*~V. 

i«i  .*i 

r 

where  a,  >0.  at  ^0,  and  £a,  *  1. 


Hypoexponential:  Hypo  (Xj.X2.-Ar) 


HypoCX^j.-Ar)  =  £afEXP(X,)  =  T>ai^ie 

J«1  «» 1 

r 

where  X,  >0.  X,  ?±\j  i i  i^j .  and  at  =  n - • 

i»l\> 

i+J 


By  using  the  method  of  stages  [26].  a  hyperexponential  distribution  can  be  modeled  as  a  set 
of  parallel  exponential  stages  and  a  hypoexponential  distribution  as  a  set  of  series  exponen¬ 
tial  stages.  Figure  5.1(a)  and  5.1(b)  show  the  conversions  of  these  two  types.  In  Figure 
5.1(a)  we  note  that  each  ai  of  hyperexponential  function  is  converted  to  the  probability  to 
the  associated  state  having  density  EXP(X; ).  however  this  is  not  the  case  for  the  hypoex¬ 
ponential.  From  Figure  5.1(b).  we  know  that  the  Markov  version  of  the  hypoexponential  is 
just  a  series  connection  of  states  in  which  each  state  has  an  simple  exponential  density  func¬ 
tion  and  the  probability  from  one  state  to  another  is  one.  As  an  example  we  know  in  sec¬ 
tion  2.3  that  the  holding  time  density  from  state  1VS  to  error  state  DASD  is  fitted  by 

/  (f  )  =  0.235  EXP(0.004)  +  0.765  Hypo(0.00093.  0.006595)  . 


which  is  a  combination  of  hyper  and  hypo  exponentials.  The  Markov  conversion  of  the 
state  with  Equation  5.2.1  holding  time  density  is  shown  in  Figure  5.2.  Note  that  the  state 
Ws  in  Figure  5.2(a)  is  modeled  as  a  three  state  Markov  process.  This  is  shown  in  the  dotted 
area  of  Figure  5.2(b). 


5.3.  Performability  Analysis 


i 


.After  converting  a  semi- Markov  process  to  a  Markov  process,  analysis  can  be  carried 
out  on  the  resulting  Markov  reward  model  of  the  measured  system  using  SHARPE  (the 
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(a).  Semi-Markov 


(b).  Markov 


Figure  5.2.  The  Markov  conversion  of  State  Wj 


Symbolic  Hierarchical  Automated  Reliability  and  Performance  Evaluator)  [29].  SHARPE  is 
a  modeling  tool  developed  at  Duke  University.  It  provides  several  model  types  ranging 
from  reliability  block  diagrams  to  complex  semi  Markov  models,  and  allows  the  user  to 
construct  and  analyze  performance,  reliability  and  availability  models.  However,  this  tool 
can  only  be  used  to  analyze  a  model  with  size  less  then  200  states,  thus  we  assume  our 


resource- usage/ error/ recovery  model  is  a  independent  semi-Markov  process.  The  conver¬ 
sion  of  the  waiting  times  of  states  are  shown  in  Appendix  C. 

In  order  to  study  the  impact  of  different  types  of  errors,  the  irreducible  semi-Markov 
process  is  converted  to  one  with  absorbing  states  in  the  following  manner: 

a)  with  OFFL  as  the  absorbing  state  (OFFL). 

b)  with  MULT  and  OFFL  as  the  absorbing  states  (MULT). 

c)  with  SWE.  MULT  and  OFFL  as  the  absorbing  states  (SWE), 

d)  with  DASD.  MULT  and  OFFL  as  the  absorbing  states  (DASD).  and 

e)  with  DASD.  SWE.  MULT  and  OFFL  as  the  absorbing  states  (ALL). 

In  case  (a)  we  assess  system  performability  in  which  all  but  off-line  failures  are  not 
recovered  from.  This  actually  provides  us  with  the  result  of  the  system  reliability.  In  case 
(b)  we  discontinue  recovering  from  multiple  errors.  Here,  we  expect  to  measure  the  impact 
on  tne  reward  to  a  multiple  error.  Since  multiple  errors  happen  much  more  frequently 
than  OFFL  and  the  sojourn  time  is  much  longer  comparing  with  others,  we  expect  to  meas¬ 
ure  the  impact  of  SWE  and  DASD  on  the  reward  to  a  MULT  error.  Thus,  in  case  (c)  we  not 
only  stop  recovering  from  multiple  and  off-line  failures  but  we  also  stop  the  recovery  from 
software  errors.  In  case  (d)  we  recover  from  SWE  errors  but  stop  recovery  from  DASD 
errors.  Finally,  in  case  (e)  we  do  not  recover  from  any  errors  besides  CHAN. 

We  compare  these  scenarios  first  using  the  expected  instantaneous  reward  rate  E[X(r  )] 

which  is  defined  by  Equation  5.1.3.  then  using  the  time-averaged  expected  accumulated 
E(Y(f )] 

reward  - .  In  all  but  case  (a)  and  (e)  we  consider  two  variations:  when  a  state  such 

t 

as  DASD  (MULT  or  SWE)  is  made  absorbing,  we  can  either  let  the  reward  rate  in  such  a 
state  be  non-zero  or  we  can  set  its  reward  rate  to  zero.  The  impact  of  the  non-zero  assign- 
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ment  is  that  upon  reaching  the  absorbing  state,  the  system  continues  to  operate  in  a 
degraded  mode.  In  the  latter  case,  i.e..  zero  reward  assignment,  we  conservatively  assume 
that  the  system  stops  functioning  when  it  reaches  the  absorbing  state(s). 

In  Figure  5.3(a),  we  plot  E[X(f)]  for  cases  (a)  and  (b).  In  the  case  (b)  we  use  two 
different  assumptions  for  the  reward  rate  for  the  MULT  state,  rrMULr=0.mn  and 
^ mult  =0.  We  also  plot  E[X(t)]  for  case  (a)  with  the  assumption  that  all  states  have 
exponentially  distributed  holding  times.  We  note  that  such  a  Markovian  assumption  leads 
to  an  overestimation  of  the  system’s  capability  to  perform  useful  work,  and  the  degree  of 
overestimation  increases  as  the  system  operating  time  increases.  We  also  note  that  not 
recovering  from  multiple  errors  considerably  degrades  the  system's  performability.  More¬ 
over.  changing  from  non-zero  to  zero  reward  rate  further  reduces  the  system’s  effectiveness 
drastically. 

,  N  E[X(r )] 

In  Figure  5.3(b),  we  plot  the  -  for  cases  c.  d  and  e.  In  each  case,  except  case  e. 

t 

we  also  have  two  versions  with  reward  rates  for  absorbing  states  being  non-zero  and  zero, 
respectively.  Note  that  not  recovering  from  SWE  errors  degrades  system  effectiveness  con¬ 
siderably  compared  with  the  effect  of  not  recovering  from  DASD  errors,  provided  we 
assume  that  absorbing  states  continues  to  provide  service  in  a  degraded  mode.  On  the  other 
hand,  if  we  assume  that  absorbing  states  are  system  failure  states,  i.e.,  zero  reward  rates  for 
absorbing  states,  then  not  recovering  from  DASD  failures  is  more  severe  than  not  recover¬ 
ing  from  SWE  failures.  This  behavior  is  explained  by  the  fact  that  the  reward  rate  in  the 
DASD  state  is  about  twice  that  in  the  SWE  state  (0.5708  versus  0.2736  in  Tabie  5.1).  Fig¬ 
ures  5.4(a)  and  5.4(b)  are  the  counterparts  of  Figures  5.3(a)  and  5.3(b)  where  the  measure 

E[y(f  )] 

plotted  is - rather  than  E[X(f )].  The  trends  are  similar. 
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Finally,  in  Figure  5.5,  we  show  the  distribution  function  of  Y(oo).  the  accumulated 
reward  until  system  failure,  for  two  cases:  Markov  versus  semi-Markov.  Both  assume  that 
the  OFFL  state  is  the  only  absorbing  state.  Once  again  we  note  that  the  Markovian  assump¬ 
tion  implies  an  overestimation  of  the  system's  performability. 


Figure  5.5.  Distribution  of  accumulated  reward  until  system  failure 
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CHAPTER  6 


SUMMARY  AND  CONCLUSIONS 


6.1.  Summary  of  Results 

This  thesis  has  developed  a  methodology  to  construct  a  resource- 
usage/reliability/performability  model  for  a  complex  system  based  on  real  data.  The  model 
obtained  is  capable  of  reflecting  both  the  normal  and  error  behavior  of  the  system.  Both 
hardware  and  software  reliability  and  their  interactions  are  modeled.  The  effect  of 
recovery  through  the  built-in  recovery  mechanisms  is  also  considered.  By  modeling  the 
recovery  process  we  are  able  to  evaluate  the  severity  of  errors  in  general  and  the  cost  of 
specific  error  type  in  particular.  Low-level  error  and  resource-usage  data  to  develop  the 
model  was  collected  on  an  IBM  3081  machine  running  the  MVS  operating  system.  The 
results  of  this  research  suggest  that  other  production  systems  should  be  similarly  analyzed 
so  that  a  body  of  realistic  data  on  computer  error  (including  failure)  and  recovery  models 
is  available. 

Chapter  2  described  the  development  of  the  model,  using  the  low  level  data  on 
resource  usage  and  errors.  A  statistical  clustering  method  (k  -means  clustering)  was 
employed  to  characterize  the  resource  usages  into  a  few  workload  clusters.  A  two-level 
error  data  reduction  (error  coalescing  and  grouping)  scheme  was  used  to  identify  individual 
error  incidents.  Results  showed  that  about  17%  of  errors  are  multiple  errors  (believed  to  be 
multiple  manifestations  of  the  same  problem).  The  state-transition  diagram  for  a  multiple 


error  was  obtained  to  study  the  interaction  between  system  components  (hardware  and 
software).  For  example,  it  was  seen  that  software  and  disk  errors  were  strongly  correlated. 

From  the  measurement  data  it  was  seen  that  the  holding  times  in  key  operational  and 
error  states  were  not  simple  exponentials.  A  semi-Markov  process  was  used  to  model  the 
system  behavior.  This  (semi-Markov)  assumption  was  also  validated  by  comparing  the 
state  occupancy  probabilities  predicted  by  the  model  with  the  actual  state  occupancy  proba¬ 
bilities  estimated  from  observed  data.  The  results  show  that  the  proposed  model  provided  a 
fairly  accurate  prediction  of  the  real  behavior. 

The  analysis  of  model  behavior  was  performed  in  Chapter  3.  The  analysis  showed 
that  on-line  recovery  is  highly  effective  and  provides  the  system  with  the  ability  to  tolerate 
many  faults  and  recover  almost  instantaneously.  An  analysis  to  extract  the  effect  of  the 
workload  on  the  error  probability  showed  that  not  only  does  a  higher  workload  result  in  a 
higher  error  probability  (for  similar  holding  time),  but  the  error  probability  also  increases 
with  increased  holding  time  in  a  particular  workload  state.  In  other  words,  the  error  pro¬ 
bability  appears  to  be  a  function  of  the  absolute  amount  of  resource  consumed,  be  it 
through  increased  workload  and/or  increased  holding  times.  An  explanation  for  this  "wear 
out"  phenomenon  is  not  clear  since  a  large  majority  of  the  collected  errors  are  transient,  but 
it  certainly  calls  into  question  the  validity  of  the  f  requently  used  constant  error  probabil¬ 
ity  assumption  used  in  reliability  modeling. 

The  significance  of  the  use  of  a  semi-Markov  model,  as  opposed  to  the  simple  Markov 
model,  to  describe  the  overall  resource-usage/error/recovery  process  was  also  investigated. 
~  he  results  showed  that  a  simple  Markov  model  frequently  overestimates  the  uncondi¬ 
tional  transition  probabilities  and  underestimates  the  variance  of  the  first  passage  times  to 
the  error  states.  The  overestimation  can  lead  to  an  unduly  conservative  reliability  predic- 


tion  and  the  underestimation  may  lead  to  unduly  optimistic  reliability  prediction.  Both 
over-  and  under-  estimations  are  not  desirable. 

In  Chapter  4.  the  software  error  data  was  used  to  build  a  software  reliability  model  to 
describe  the  error  and  recovery  processes  in  the  MVS  operating  system.  The  semi-Markov 
model  developed  provided  a  quantification  of  the  operating  system  error  characteristics  and 
also  the  interaction  between  different  types  of  OS  errors.  We  estimate  that  in  only  0.5%  of 
the  cases  the  measured  software  system  is  unable  to  recover.  A  detailed  model  and  analysis 
of  multiple  software  errors,  (which  constitute  approximately  17%  of  all  software  errors) 
was  provided,  showing  how  a  single  software  problem  can  have  multiple  manifestations. 
To  investigate  the  validity  of  this  model,  the  duration  of  a  multiple  error  predicted  from 
the  model  was  compared  with  the  value  estimated  from  the  observed  data.  The  agreement 
between  two  results  was  found  to  be  within  1%. 

A  measurement-based  performability  model  was  discussed  in  Chapter  5.  A  reward 
function,  based  on  the  service  rate  and  the  error  rate  in  each  state,  was  proposed.  In  order 
to  investigate  the  impact  due  to  different  errors,  the  expected  reward  rate,  as  well  as  the 
cumulative  reward,  at  time  t  were  estimated.  The  results  show  that  the  software  error 
(SWE)  degrades  the  system  performance  more  severely  than  the  disk  error  (DASD) 
although  the  error  probability  of  DASD  errors  is  about  twice  as  much  as  that  of  SWE 
errors  (0.169  versus  0.085).  This  may  be  due  to  the  cost  for  DASD  errors,  which  is  less 
than  that  for  SWE  errors,  i.e..  the  reward  rate  in  DASD  state  is  higher  than  that  in  SWE 
state.  If.  however,  both  error  types  result  in  system  failure  then,  as  expected,  the  DASD 
error  degrades  the  system  pertormance  more  severely  than  the  SWE  error. 


The  system  performability  under  a  Markov  assumption  is  also  estimated  and  com¬ 
pared  with  that  estimated  from  the  more  realistic  semi-Markov  modei.  It  was  found  that 
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the  Markov  assumption  overestimates  the  system  performability  and  that  the  degree  of 
overestimation  increased  with  increased  system  operation  time.  Once  again,  this  indicates 
that  the  traditional  Markov  process  is  not  good  enough  to  model  a  computer  system  and  to 
provide  accurate  predictions. 

6.2.  Suggestions  for  Future  Research 

The  results  of  this  study  suggest  that  other  systems  be  similarly  studied  so  that  a 
wide  body  of  realistic  results  on  computer  system  hardware  and  software  performability 
are  available.  This  is  useful  both,  from  the  point  of  view  of  validating  existing  analytical 
models  and  from  the  point  of  view  of  generating  realistic  models  of  system  behavior. 

A  possible  extension  is  the  area  of  adaptive  model  construction.  The  workload  and 
error  clustering  methods  employed  here  have  potential  for  use  in  an  adaptive  algorithm 
which  is  capable  of  real-time  model  construction.  The  use  of  such  models  for  adaptive  tun¬ 
ing  for  optimum  performability  under  various  conditions  needs  to  be  investigated.  To  be 
successful  such  a  system  would  require  learning  capabilities  so  as  to  use  valid  past  informa¬ 
tion  together  with  some  knowledge  of  the  environment  for  both  reconfiguration  under 
failure  and  for  system  tuning. 

In  this  thesis  we  have  used  past  data  on  errors  and  workload  for  model  construction. 


It  would  be  interesting  to  investigate  the  possibility  of  doing  the  same  on  the  basis  of  data 
generated  from  error/failure  injection  on  a  prototype  or  into  a  simulation  model  of  a  sys¬ 
tem.  Such  a  procedure  has  the  potential  of  providing  realistic  feedback  to  system  designers 
early  in  the  development  stage.  A  comparison  of  the  results  from  such  a  model  with  those 
obtained  through  analytical  models  would  be  instructive  as  well. 
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APPENDIX  A 


(I).  Error  Clustering 

Identical  errors  occurring  within  5  minutes  of  each  other  were  coalesced  into  a  single 
event.  This  was  done  to  ensure  that  the  analysis  is  not  biased  by  failure  records  relating  to 
the  same  problem.  The  clustering  algorithm  analyzes  the  data  and  merges  observations 
which  occur  in  rapid  succession  and  relate  to  the  same  problem.  For  each  failure  point,  the 
following  test  was  performed  : 

IF  <  error  type>  «  <type  of  previous  error  >  AND 
Ctime  away  from  previous  error >  ^  5  minutes 
THEN 

<fold  error  into  cluster  being  built  > 

ELSE 

<start  a  new  cluster  > 

The  result  is  a  set  of  clustered  errors.  .Associated  with  ea^r. 
ing  of  error  classifications,  number  of  points  .n  toe  .  .^vr-  •  <■ 

the  cluster,  and  a  variety  of  status  viata  pr  ■  idr,.:  ■  •  *•  .. 

ill).  Error  t-r^uping 
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rate  introduces  the  suspicion  that  the  errors  occurring  during  the  high  error  rate  period  may 
be  related,  Le..  different  errors  may  be  due  to  a  single  cause,  to  multiple  but  related  causes, 
or  to  multiple  and  independent  causes.  Therefore,  the  high  error  rate  periods  are  formed  by 
grouping  all  error  clusters  occurring  within  a  small  time  interval  of  each  other.  This  inter¬ 
val  was  chosen  to  be  5  minutes.  The  result  is  a  set  of  grouped  errors.  The  primary 
difference  between  a  cluster  and  a  group  is  that  clusters  contain  only  occurrences  of  the 
same  error  (same  error  type  and  machine  state),  whereas  groups  contain  occurrence  of 
different  errors  (different  error  type  or  machines  state). 
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APPENDIX  B 


The  Characteristics  of  the  Resource-Usage/Error/RecoTery  Model 


(I).  Stochastic  transition  probability  matrix. 

(Due  to  the  size  of  the  matrix,  it  is  broken  into  two  parts,  (a)  and  (b).) 
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Waiting  and  holding  time  densities.  Constant  sojourn  times  are  assigned  in  W0.  W2. 


CPU,  CHAN  and  recovery  states.  The  others  are  shown  below. 
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Waiting/hold  lag  time  densities  for  W4 


Waiting  time  density  of  Ws 


Waiting  time  density  of  Ws 


Waiting/holding  time  densities  for  W6 


APPENDIX  C 


t 

\ 


Semi-Markov  to  Markov  Conversion 


The  state  conversions  of  the  resource-usage/ error/recovery  process  from  a  semi- 
Markov  model  to  Markov  model  is  demonstrated  in  this  appendix.  Here,  we  assume  that 
the  model  is  an  independent  semi-Markov  process  because  of  the  limitation  of  SHARPE. 

The  CPU  bound  workload  state  is  used  to  estimate  the  system's  performability.  State 
W2  is  combined  with  W3  because  W2  has  very  few  observations.  CHAN  error  state  is  also 
ignored  because  it  has  very  few  observations.  The  semi-Markov  to  Markov  conversion  of 
the  workload  states  are  shown  from  Figure  (a)  through  (g)  and  the  conversion  of  three 
error  states  are  shown  from  Figure  (h)  through  (j).  After  the  conversion,  the  overall  model 
is  expanded  from  15  states  to  34  states. 
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